1

I am confronted with the following issue. I fit a classification model for an unbalanced dataset with a binary target. The minority class is very less frequent than the majority class, therefore I used oversampling for balancing the class in the dataset before fitting.

Fair enough, the model gives good results for predicting the correct class.

Problem: I don't need only the prediction of the class labels, but also the prediction for class probabilities.

If I use the model for predictions, it tends to predict most probabilities polarized under the two extremes of probabilities value (probabilities < 10 % or probabilities > 90 %). I guess this is caused by the fact that I fitted the model with an oversampled dataset with 50 % of majority class and 50% of minority class, while in the the original dataset the minority class has a frequency of about 5%.

Do you have any clue about how to treat this issues? Have you already had this problem?

Do you have an idea on how to balance/calibrate probabilities of a model fitted on an oversampled dataset for a binary classification problem taking in account the reality of the unbalanced dataset?

Thanks in advance for you help, all the best,

Davide

EDIT:

I found a page giving a method for re-calibrating probabilities obtained via over/undersampling in classification:

Convert predicted probabilities after downsampling to actual probabilities in classification

0 Answers0