1

I have a highly unbalanced dataset (1000 vs 60). Where I want to use upsampling. The real life distribution of the problem (predicting no show) is probably also very highly imbalanced. My question is two-fold

1) I know that I should keep the distribution of my training and test set as close to the real life setting so that the distributions are identical, so should I do upsampling?

2) If I should upsample. Should I upsample both my training and test set, or only the training set? Imagine my training set (750 vs 45) and test set (250 vs 15). Should I bring it to (750 vs 750) and (250 vs 250), or to (750 vs 750) and (250 vs 15)?

Any relevant papers concerning this problem are also welcome.

1 Answers1

1

Part of the answer depends on whether you are using a probabilistic classifier or not: ie does the model produce an estimate of probability of being class A, or just a hard decision A or B?

1) If you use a (well calibrated) probabilistic classifier (logistic regression, most neural nets xgboost with logloss) then it probably doesn't matter too much (and you readjust your prediction proportionately). If you don't use a probabilistic classifier then you have to rebalance, and then ...???

2) don't adjust your test set -

I suspect any differences might be random/problem specific - ie suck it and see.

seanv507
  • 6,743