Is doing oversampling on train set and undersampling on test set correct?

Question

I have an imbalanced dataset (95% in class 0 and 5% in class 1) and I am using machine learning for classification. The AUC(Area under ROC curve) was high (about 0.86) but AUPRC(Area under precision-recall curve) was very low (about 0.04) because of imbalanced dataset.

I did oversampling on training test but both AUC and AUPRC for test set became too bad!(smaller than 0.1, but for training both were good) According to stratification each class should have same rate on train and test set. so I decided to do undersampling on test set. Eventually, mu model has AUC=0.80 and AUPRC=0.79.

I am not sure this is true or not. where I made mistake? Is it correct to do undersampling on test set?

Is different rate of classes between train and test harmful in oversampling? or my model isn't good?

When you say undersampling on test set you mean throwing out samples of the minority class? You shouldn't use over/undersampling techniques on your test set. Split your data set first, then use only your training set to experiment. — Laksan Nathan, Oct 18 '19 at 13:41
Thanks for your comment. I mean I eliminate some sample of majority class in test set ,so in test set I have 40 sample belong class 0 and 40 sample belong to class 1. Is this False? I've read over/undersampling techniques is for training set but in test set but I have only 40 sample of class 0 and 800 sample of class 1 which is different with rates in training set after doing oversampling. according stratification ,rates of each class in train and test set should be equal. — user229019, Oct 18 '19 at 14:18

score 2 · Answer 1 · answered Apr 09 '23 at 19:25

Oversampling is largely a solution to a non-problem Consequently, it is up for debate if you should be doing this at all on your training set.

However, except for some particular situations (I give one below), fiddling with the data in the test set is a terrible idea. The idea of having a test set is to get an honest evaluation of how the model will perform in production. If you do this under conditions that are not representative of the real conditions, you are tricking yourself into believing your model is better than it really is or not as good as it is. Using a model for real that is not as good as you think is bad for business because the performance will turn out to be subpar. Underestimating your performance can result in spending time and money fixing a model that is not broken. You do not want either of these.

A valid reason for fiddling with test data could be to check out what happens if there is some kind of data drift (checking robustness), but that does not seem to be what is happening here.

In this case, I don't think the AUROC would be affected by resampling the test set as it is the probability that a randomly selected positive pattern is ranked higher than a randomly selected negative pattern. Oversampling may be a solution to cost sensitive learning if you are using a classifier system that doesn't already have a good solution for that problem (which is the context in which SMOTE was originally proposed). I don't think I use any classifier systems though where that is the case today! — Dikran Marsupial, May 10 '23 at 17:54

Is doing oversampling on train set and undersampling on test set correct?

1 Answers1