Predicted Probabilities assuming only extreme values in gradient boosting models for imbalanced datasets

Question

I'm training a binary classification model for fraud detection and my historical dataset is extremely imbalanced. I have tried training a LGBM and XGBOOST models and in both cases I've set the parameter class_weight = 'balanced'.

When running predict_proba, it only returns extreme scores, that means only values close to 0 or values close to 1. I suspect that this is because of the imbalanced dataset. I'm performing treshold moving to find the "sweet spot" though.

But is there any real problem with the scores behaving like that? Should i fix it or try changing the hyperparameters? I know that i can perform a probability calibration, so that the scores are more "reliable". Aside from that, should i concern about the distribution of the probability scores?

What does the balanced class weight mean in your software implementation? — Dave, Oct 18 '22 at 18:06
@Dave In cost function it penalizes even more the misclassification of the minority class, according to the proportion of minority/majority class. — Gabriel Monteiro, Oct 18 '22 at 18:45
What cost function are you using? Your last comments sounds like it might be a weighted form of accuracy. Don't use accuracy. Instead, use probabilistic predictions and assess these using proper scoring rules. For more on "unbalanced data", see this thread and links therein. — Stephan Kolassa, Oct 19 '22 at 06:17
@StephanKolassa thank you for the reply. The cost function being used is the default from xgboost, lgbm, etc. For Xgboost, it is the log loss function and when setting class_weight = 'balanced' i guess it just weights this function, like logit regression — Gabriel Monteiro, Oct 19 '22 at 12:08

Predicted Probabilities assuming only extreme values in gradient boosting models for imbalanced datasets

0 Answers0