How to deal with probabiliy estimation error after fitting a model via Oversampling/undersampling?

Question

I am confronted with the following issue. I fit a classification model for an unbalanced dataset with a binary target. The minority class is very less frequent than the majority class, therefore I used oversampling for balancing the class in the dataset before fitting.

Fair enough, the model gives good results for predicting the correct class.

Problem: I don't need only the prediction of the class labels, but also the prediction for class probabilities.

If I use the model for predictions, it tends to predict most probabilities polarized under the two extremes of probabilities value (probabilities < 10 % or probabilities > 90 %). I guess this is caused by the fact that I fitted the model with an oversampled dataset with 50 % of majority class and 50% of minority class, while in the the original dataset the minority class has a frequency of about 5%.

Do you have any clue about how to treat this issues? Have you already had this problem?

Do you have an idea on how to balance/calibrate probabilities of a model fitted on an oversampled dataset for a binary classification problem taking in account the reality of the unbalanced dataset?

Thanks in advance for you help, all the best,

Davide

EDIT:

I found a page giving a method for re-calibrating probabilities obtained via over/undersampling in classification:

Convert predicted probabilities after downsampling to actual probabilities in classification

Welcome to Cross Validated! While it is possible to calibrate your model, it would usually make more sense to start with the true class ratio and allow that to inform your probability predictions. Thus, why oversample at all? // Are unbalanced datasets problematic, and (how) does oversampling (purport to) help? — Dave, Oct 30 '22 at 00:47
More threads on "unbalanced" data here: https://stats.meta.stackexchange.com/q/6349/1352. Class imbalance is not a problem, and oversampling (i.e., modifying your dataset so it is not representative of the population any more) will not help you solve a non-problem. The puzzling insistence on various flavors of oversampling is likely a consequence of the reliance of accuracy, a highly problematic evaluation measure. See also Is accuracy an improper scoring rule in a binary classification setting? — Stephan Kolassa, Oct 30 '22 at 07:23

How to deal with probabiliy estimation error after fitting a model via Oversampling/undersampling?

0 Answers0