I have a balancedrandomforest model which was trained on unbalanced data (92/8) for a binary classification problem.
The AUC is around 0.98 and the precision and recall are also acceptable being 0.89 and 0.95 respectivelly. Looking at these metrics, I easily consider this algorithm to be fairly well trained, and that it should behave fairly well on unseen data.
However, when I looked at the predicted class probabilities, I saw that it concentrated on the low and high end of the chart, meaning that it has almost no doubts when separating between negative and positive classes:
Also, here's same plot with the true class as red:
And the respective zoom for the higher probs:
Which shows the great match between prediction and true classes, aside from the small number of FP in the last bar.
My question is, is this distribution normal? Or is my algorithm being leaked in some way? I've already looked at the most important features, and none of the selected features are suspicious of having leaked info.
This is specially important as in previous iterations of the algorithm (AUC 95), the probability threshold which separates between classes was typically tuned higher by me (to around 0.65-0.75), to reduce the false positive count. With this version though, I can't tune it to the same values, and I'm afraid there's some negative effect happening that I just have no knowledge of.
Any ideas are welcome, and if you need more info, please let me know.
Thanks in advance!


