Unbalanced distribution of multi-classes, how can I divide training/testing set

Question

experts of the statistics,

I am a newbie student in the machine learning field.

I just started a job to classify set of scientific abstracts into five classes.

The text distribution is as below:

Class1: 200

Class2: 950 Class3: 150 Class4: 100 class5: 350

I am planning to make a multi-class classifier,

however, I worry about the balance of the number of texts for each class. For example, if I use 100 documents for training each class, the class 2 has relevantly too many testing data.

I wanted to get some insightful idea to construct my training/testing set and the reason.

Sincerely yours,

Compute the distribution of the classes, for example, c1 = 200/(200+950+150+100+350) ~ 0.11 , c2=950/(200+950+150+100+350) ~ 0.54 and so on. If you want to make a 70%/30% split then you select 0.110.3200 of the elements belonging to class 1 randomly in order to go to the test set. The rest goes to the training set. Repeat for each class. In that way you get that the distribution of the classes in train and test are similar. — Fabian Werner, May 08 '18 at 09:04
@FabianWerner Thank you for your attention :) Did you mean that the bias problem is relieved if the ratio of the whole classes is maintained in the training and testing set? — W Lee, May 09 '18 at 06:40
@FabianWerner 0.110.3200 = 6.6 , is it right? If the main point is to maintain the population distribution in each class, is it reasonable just divide each set by same partition ratio such as 0.7, 0.3? — W Lee, May 09 '18 at 07:20
Depends on what you mean by 'bias problem'. If you mean "bias problem = classes are distributed unequal" then no, you will not solve this by sampling the training and test set correctly. You have to adapt your model in order to 'solve' this problem (misclassification of the rare class(es) is punished much more than misclassification of the overwhelming class(es)). What I suggested just makes it possible for you to see which model is capable of doing this and which is not. On your second comment: you are absolutely right and if you have millions of data points then what you say — Fabian Werner, May 09 '18 at 10:49
works equally well. However, since you have only a few hundrets, randomness in the sampling might cause that the distribution of the classes in your test set is skewed... I also see that I was doing it in a wrong way: you just select 0.3200 of the first class, 0.3950 of the second class and so on and let these go into the test set, not 0.110.3200, sorry for that. — Fabian Werner, May 09 '18 at 10:51
I appreciate you @FabianWerner very much for your helpful comments. — W Lee, May 10 '18 at 03:40

Unbalanced distribution of multi-classes, how can I divide training/testing set

0 Answers0