I’m a beginner in machine learning and I’m facing a situation. I’m working on a Real Time Bidding problem, with the IPinYou dataset and I’m trying to do a click prediction.
The thing is that, as you may know, the dataset is very unbalanced : Around 1300 negative examples (non click) for 1 positive example (click).
This is what I do:
- Load the data
- Split the dataset into 3 datasets : A = Training (60%) B = Validating (20%) C = Testing (20%)
- For each dataset (A, B, C), do an under-sampling on each negative class in order to have a ratio of 5 (5 negative example for 1 positive example). This give me 3 new datasets which are more balanced: A’ B’ C’
Then I train my model with the dataset A’ and logistic regression.
My question are:
Which dataset do I have to use for validation ? B or B’ ?
Which dataset do I have to use for testing ? C or C’
Which metrics are the most relevant to evaluate my model? F1Score seems to be a well used metric. But here due to the unbalanced class (if I use the datasets B and C), the precision is low (under 0.20) and the F1Score is very influenced by low recall/precision. Would that be more accurate to use aucPR or aucROC ?
If I want to plot the learning curve, which metrics should I use ? (knowing that the %error isn’t relevant if I use the B’ dataset for validating)
Thanks in advance for your time !
Regards.
- When you say "to train on A' and test on B'", do you mean validate ?
- "generate learning curves for C" & "F1(C) score is under/similar to F1(B)". I though that, for the learning curve, we had to plot the error metric for the training set (A or A' here) and the error metric for the validating set (B or B') only. Aren't you "validate" on C here ?
– jmvllt Nov 19 '15 at 09:49Anyway thanks for your time !
– jmvllt Nov 19 '15 at 09:49