Binary Classification Problem with Predicted Probabilities distribution skewed

Question

I have a balancedrandomforest model which was trained on unbalanced data (92/8) for a binary classification problem.

The AUC is around 0.98 and the precision and recall are also acceptable being 0.89 and 0.95 respectivelly. Looking at these metrics, I easily consider this algorithm to be fairly well trained, and that it should behave fairly well on unseen data.

However, when I looked at the predicted class probabilities, I saw that it concentrated on the low and high end of the chart, meaning that it has almost no doubts when separating between negative and positive classes:

Also, here's same plot with the true class as red:

And the respective zoom for the higher probs:

Which shows the great match between prediction and true classes, aside from the small number of FP in the last bar.

My question is, is this distribution normal? Or is my algorithm being leaked in some way? I've already looked at the most important features, and none of the selected features are suspicious of having leaked info.

This is specially important as in previous iterations of the algorithm (AUC 95), the probability threshold which separates between classes was typically tuned higher by me (to around 0.65-0.75), to reduce the false positive count. With this version though, I can't tune it to the same values, and I'm afraid there's some negative effect happening that I just have no knowledge of.

Any ideas are welcome, and if you need more info, please let me know.

Thanks in advance!

You might find it informative to color-code that histogram according to true class. — Dave, May 22 '22 at 02:14

score 1 · Answer 1 · answered May 22 '22 at 01:25

It could be that you leaked data in some step of your process, but we hope to achieve such performance. It’s great to be able to look at a case (or have the model look at a case) and say, “Yep, that’s pretty much always going to be this.”

Your strong performance on AUC, precision, and recall are totally consistent with your graph.

While this is somewhat divergent from your question, I would be remiss if I did not post my usual links about class imbalance and proper scoring rules that directly evaluate the probability outputs and seek out the true probability values. (None of AUC, precision, and recall are strictly proper scoring rules.)

Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?

https://www.fharrell.com/post/class-damage/

https://www.fharrell.com/post/classification/

https://stats.stackexchange.com/a/359936/247274

Proper scoring rule when there is a decision to make (e.g. spam vs ham email)

Why is it that if you undersample or oversample you have to calibrate your output probabilities?

https://twitter.com/f2harrell/status/1062424969366462473?lang=en

Thanks for the reply Dave, as well as the links. Gave me a bit of peace of mind and also some new ideas on to look at these probabilities! — Márcio Coelho, May 24 '22 at 08:35

Binary Classification Problem with Predicted Probabilities distribution skewed

1 Answers1