My understanding of oversampling/undersampling is merely manipulate the data to increase the proportion of rare group in your target. Isn’t this manipulating data? How could you expect a reliable outcome oversampling/undersampling? For example, you fit logistic regression on oversampling/under sampling data and you use that to give score/probability to original data. Your score will be changed for given variables compare to the logistic regression model developed using original data to score the data. So you need to adjust your score from model developed on oversampling/under sampling data. But why do you require to adjust your probability score when the purpose of oversampling/undersampling is to increase probability of rare class. Isn’t adjust probability/score reverting everything back to original status? Can someone explain in simple terms?
-
3Good news! Class imbalance is not a problem! https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he https://www.fharrell.com/post/class-damage/ https://www.fharrell.com/post/classification/ https://stats.stackexchange.com/a/359936/247274 https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email https://twitter.com/f2harrell/status/1062424969366462473?lang=en – Dave Jul 07 '21 at 01:10
-
It is perhaps worthwhile to ask when imbalanced data is a problem, and whether that problem is one we are required to solve. https://stats.stackexchange.com/questions/283170/when-is-unbalanced-data-really-a-problem-in-machine-learning – Sycorax Jul 07 '21 at 01:54
-
It still did not really explains…….. – gyambqt Jul 07 '21 at 08:30
-
2Can you elaborate? I think the linked threads are very clear. What part of these answers is unclear, specifically? – Sycorax Jul 07 '21 at 16:18
-
Could you refer to this: https://stats.stackexchange.com/questions/533678/oversampling-undersampling-issue – gyambqt Jul 08 '21 at 11:16
1 Answers
A typical use of oversampling or other artificial balancing of the categories is to make it so the minority category has a better chance of having a prediction above the threshold to transform continuous model predictions into discrete categorical predictions. However, when the categories are imbalanced, it might be that the majority category is always more likely. Consequently, to get predictions that are aligned with the reality of how frequently the categories really occur, those artificially inflated high predictions have to be toned down.
So the strategy is:
Artificially inflate the probability of membership in the minority category so thresholded predictions are more likely to be above the threshold.
Calibrate these inflated predictions so the final predictions of a pipeline are related to the true probabilities of event occurrence. That is, we do not want a predicted probability of $0.6$ to correspond to the event happening $20%$ of the time, as this would mean that the predicted probability is not telling the truth.
At best, this strikes me as inefficient.$^{\dagger}$ At worst, it misleads aspiring machine learning modelers into deemphasizing the rich information available in the probability predictions and to obsess over a threshold of $0.5$ just because that is the software default. Even if there is considerable information available in full probability predictions, at the very least, it is possible to change the threshold to something more reasonable for the task if you must use a threshold (such as in an automated software system that either does or does not ring an alarm).
$^{\dagger}$There are interesting edge cases where such an approach of oversampling and then adjusting the outputs can be a good idea. There is a nice example in the comments related to computational efficiency and another one linked here (by the same member as the comment).
- 62,186
-
1This strategy can be more efficient if used appropriately. Say you are looking at a screening test for a disease with an operational occurrence of one in 100,000. In that case it would probably be a waste of CPU time to build a model with 100,000 negative examples for each positive example. The value of data is often a matter of diminishing returns and the characterisation of the distribution of the negative class will not benefit greatly from having so many examples. In that case, sub-sampling the majority class and adjusting the output to compensate for the disparity between ... – Dikran Marsupial Apr 10 '23 at 13:04
-
1calibration and operational class frequencies is a perfectly reasonable thing to do, and computationally more efficient. Note also that data can be expensive to collect, store and process, which is another way in which this strategy can be more efficient. For modern machine learning methods (like logistic regression ;o) there is rarely a statistical need to up- or down-sample, but there may be computational reasons. – Dikran Marsupial Apr 10 '23 at 13:06