I am running a Logistic Regression model on a very upsampled data (upsampled by repeating certain observations). It appears that the expected predicted probabilities of majority class for certain cuts in the data I use for training are not equal or close to the actual proportions of majority class in the same cuts i.e. actual proportions in sample (cutting the sample by various groups) don't match the expected predicted probabilities within the same sample cuts. Under what scenario does that happen? And is it important for the two to be close to one another?
Asked
Active
Viewed 153 times
1
-
2This is why statisticians oppose upsampling: your predicted probabilities wind up skewed too high, because you’re telling the model not to be so skeptical about minority-class membership. – Dave Dec 07 '22 at 02:16
1 Answers
0
When you upsample, you are telling the model that the minority class is more likely. Consequently, you should not be surprised to find that the model lacks a skepticism about membership in the minority class. Further, by making it more likely to belong to the minority class, you are making it less likely to belong to the majority class, so you should not expect the predictions to be altered just for members of the minority class.
While class imbalance is minimally problematic in most situations and there is no need to use upsampling to fix a non-problem, the good news is that you can calibrate your model to account for having altered the prior probability (class ratio).
Dave
- 62,186