With unbalanced class, do I have to use under sampling on my validation/testing datasets?

Question

I’m a beginner in machine learning and I’m facing a situation. I’m working on a Real Time Bidding problem, with the IPinYou dataset and I’m trying to do a click prediction.

The thing is that, as you may know, the dataset is very unbalanced : Around 1300 negative examples (non click) for 1 positive example (click).

This is what I do:

Load the data
Split the dataset into 3 datasets : A = Training (60%) B = Validating (20%) C = Testing (20%)
For each dataset (A, B, C), do an under-sampling on each negative class in order to have a ratio of 5 (5 negative example for 1 positive example). This give me 3 new datasets which are more balanced: A’ B’ C’

Then I train my model with the dataset A’ and logistic regression.

My question are:

Which dataset do I have to use for validation ? B or B’ ?
Which dataset do I have to use for testing ? C or C’
Which metrics are the most relevant to evaluate my model? F1Score seems to be a well used metric. But here due to the unbalanced class (if I use the datasets B and C), the precision is low (under 0.20) and the F1Score is very influenced by low recall/precision. Would that be more accurate to use aucPR or aucROC ?
If I want to plot the learning curve, which metrics should I use ? (knowing that the %error isn’t relevant if I use the B’ dataset for validating)

Thanks in advance for your time !

Regards.

score 12 · Accepted Answer · answered Nov 19 '15 at 02:05

Great question... Here are some specific answers to your numbered questions:

1) You should cross validate on B not B`. Otherwise, you won't know how well your class balancing is working. It couldn't hurt to cross validate on both B and B` and will be useful based on the answer to 4 below.

2) You should test on both C and C` based on 4 below.

3) I would stick with F1 and it could be useful to use ROC-AUC and this provides a good sanity check. Both tend to be useful with unbalanced classes.

4) This gets really tricky. The problem with this is that the best method requires that you reinterpret what the learning curves should look like or use both the re-sampled and original data sets.

The classic interpretation of learning curves is:

Overfit - Lines don't quite come together;
Underfit - Lines come together but at too low an F1 score;
Just Right - Lines come together with a reasonable F1 score.

Now, if you are training on A` and testing on C, the lines will never completely come together. If you are training on A` and testing on C` the results won't be meaningful in the context of the original problem. So what do you do?

The answer is to train on A` and test on B`, but also test on B. Get the F1 score for B` where you want it to be, then check the F1 score for B. Then do your testing and generate learning curves for C. The curves won't ever come together, but you will have a sense of the acceptable bias... its the difference between F1(B) and F1(B`).

Now, the new interpretation of your learning curves is:

Overfit - Lines don't come together and are farther apart than F1(B`)-F1(B);
Underfit - Lines don't come together but the difference is less than F1(B`)-F1(B) and the F1(C) score is under F1(B);
Just right - Lines don't come together but the difference is less than F1(B`)-F1(B) with an F1(C) score similar to F1(B).

General: I strenuously suggest that for unbalanced classes you first try adjusting your class weights in your learning algorithm instead of over/under-sampling as it avoids all of the rigor moral that we've outlined above. Its very easy in libraries like scikit-learn and pretty easy to hand code in anything that uses a sigmoid function or a majority vote.

Hope this helps!

Thanks a lot @AN605. That's so nice of you ! I have a few quesitons : For the 4)

When you say "to train on A' and test on B'", do you mean validate ?

"generate learning curves for C" & "F1(C) score is under/similar to F1(B)". I though that, for the learning curve, we had to plot the error metric for the training set (A or A' here) and the error metric for the validating set (B or B') only. Aren't you "validate" on C here ? — jmvllt, Nov 19 '15 at 09:49
About using the "class weights", correct me if I'm wrong (I just had a quick look about it) but, this trick involves to "modify" the cost function by adding a coefficient/weight "k" in front of the unbalanced class, right ? : Cost( h(x) , y ) = -y * k * log( h(x) ) - (1-y) * log( (h(x) ) Like that, the algorithm should considers a misclassification of the positive class as more important. But the thing is that I "have to" use Apache Spark & MLlib to build my all model. And I'm not sure that I can modify easily my cost function with spark.
Anyway thanks for your time ! — jmvllt, Nov 19 '15 at 09:49

score 5 · Answer 2 · answered Nov 18 '15 at 22:38

5

For 1) and 2), you want to

1) choose a model that performs well on data distributed as you 
   expect the real data will be 
2) evaluate the model on data distributed the same way

So for those datasets, you shouldn't need to balance the classes.

You might also try using class weights instead of under/oversampling, as this takes care of this decision for you.

For 3) you likely want to optimize using whatever metric you will be scored on (if it's a competition). But if that isn't a consideration, all those models are fine choices. F1 may be influenced by the low precision, but you want that to be captured. It's precisely when naive models (like guessing the majority class) can score well by some metrics that scores like F1 are relevant.

As for 4) there is nothing wrong with showing whichever metric you end up optimizing on.

answered Nov 18 '15 at 22:38

jamesmf

3,097
1
17
25

Hi @jamesmf, thanks for that cool answer. For the F1Score, the problem I had is that I may want to focus more on eliminate the False Positive more than the False Negative. Would that be right to add different "weight" for FP and FN in the computing of precision and recall ? – jmvllt Nov 19 '15 at 09:57
That makes sense to me. Also your description of class weighting is correct, and I don't see it implemented in MLib, but it might be worth a feature request – jamesmf Nov 19 '15 at 14:11
Okay thanks james !
I'm currently trying to do it by myself by extending the LogisticGradient class and overwritting the compute method. I will let you know if this give me good results.

Have a good day.
– jmvllt Nov 19 '15 at 15:30

score 2 · Answer 3 · edited Apr 13 '17 at 12:50

2

You should test your classifier on a dataset that represents the why it will be used. The best is usually unmodified distribution.

During the learning, modify the dataset in anyway that helps you.

For details, see Should I go for a 'balanced' dataset or a 'representative' dataset?

edited Apr 13 '17 at 12:50

Community

1

answered Nov 19 '15 at 07:05

DaL

2,633
12
13

With unbalanced class, do I have to use under sampling on my validation/testing datasets?

3 Answers3

Linked