I am confronted with the following issue. I fit a classification model for an unbalanced dataset with a binary target. The minority class is very less frequent than the majority class, therefore I used oversampling for balancing the class in the dataset before fitting.
Fair enough, the model gives good results for predicting the correct class.
Problem: I don't need only the prediction of the class labels, but also the prediction for class probabilities.
If I use the model for predictions, it tends to predict most probabilities polarized under the two extremes of probabilities value (probabilities < 10 % or probabilities > 90 %). I guess this is caused by the fact that I fitted the model with an oversampled dataset with 50 % of majority class and 50% of minority class, while in the the original dataset the minority class has a frequency of about 5%.
Do you have any clue about how to treat this issues? Have you already had this problem?
Do you have an idea on how to balance/calibrate probabilities of a model fitted on an oversampled dataset for a binary classification problem taking in account the reality of the unbalanced dataset?
Thanks in advance for you help, all the best,
Davide
EDIT:
I found a page giving a method for re-calibrating probabilities obtained via over/undersampling in classification:
Convert predicted probabilities after downsampling to actual probabilities in classification