Worse AUC but better metrics (Recall, Precision) on a classification problem - How can this happen?

Question

I have two models on which I calculate train and test performances. They both are the same algorithm (lightgbm), same hyper-params, only the data differ (the second one has the data from the first one plus some more).

The first model returns 0.73 AUC with 0.32 Precision and 0.44 Recall (all these on the test set). The second model returns a considerably lower AUC (0.65) but with higher Precision (0.35) and Recall (0.66). Isn't that a paradox? Oh and if that matters, I should mention that the problem is imbalanced (10% class 1).

It may be that the high-AUC model outperforms the low-AUC at the threshold you’re using (probably $0.5$), but check other thresholds. // Performance metrics do not have to agree; that’s why we have multiple metrics. // The best metrics tend to be so-called strictly proper [tag:scoring-rules] like log loss (cross-entropy loss) and Brier score, both of which perform fine when the classss are imbalanced. Frank Harrell has written about this one his blog. // Log loss and Brier score, both of which are strictly proper scoring rules, won’t even always agree! — Dave, Oct 07 '21 at 15:20
Is the second dataset less imbalanced than the first? Also, you should be using area under the precision recall curve instead of just precision and recall at whatever threshold the default is. — tkunk, Oct 07 '21 at 22:07
@DemetriPananos 2500 observations the first one, 4000 the second one — Georgios Sarantitis, Oct 08 '21 at 08:47

Dave · Answer 1 · 2023-03-29T13:37:21.270

Remember that ROC curves are constructed by considering all thresholds, while metrics like accuracy, sensitivity, specificity, precision, and recall only use one threshold. When you configure your software to calculate the precision and recall of the models when the threshold is changed, I would expect you to find that the high-AUC model tends to outperform the low-AUC model.

However, it usually is preferable to evaluate the probability predictions, rather than applying thresholding. Two common ways of doing this are called log loss ("cross-entropy loss" in a lot of neural network circles) and Brier score. Frank Harrell has two good blog posts about this topic.

Damage Caused by Classification Accuracy and Other Discontinuous Improper Accuracy Scoring Rules

Classification vs. Prediction

Stephan Kolassa wrote a nice answer to a question of mine that gets at this topic, too.

Academic reference on the drawbacks of accuracy, F1 score, sensitivity and/or specificity

Note that strictly proper scoring rules like log loss and Brier score need not agree about which model performs better (fairly easy to simulate), so it should not be expected that AUC and precision or AUC and recall agree on the better model, either.

Are there any publications I could cite to support this point about using proper scoring metrics. I'm having a hard time convincing others. — N Blake, Oct 07 '21 at 16:30
That first link from Harrell's blog links to a Journal of Statistical Software paper, "Evaluating Probabilistic Forecasts with scoringRules". Perhaps start there and consult the references inside. The Wikipedia article on scoring rules should have some references, too. — Dave, Oct 07 '21 at 17:19

score 0 · Answer 2 · answered Mar 29 '23 at 12:27

Here's how it may happen: AUC-ROC calculation is based in Sensitivity and Specificity values, both of which are based on the correctly predicted values, both Positive and Negative:

Sensitivity = True Positive Rate = TPos / (TPos + FNeg)

Specificity = True Negative Rate = TNeg / (TNeg + FPos)

Precision and Recall on the other hand are based on the True Positive values:

Precision = Positive Predictive Value = TPos / (TPos + FPos)

Recall (same as Sensitivity) = True Positive Rate = TPos / (TPos + FNeg)

Note that AUC-ROC metric takes in consideration all 4 values (TPos, TNeg, FPos and FNeg). However, for Precision and Recall only 3 of those values are considered for the calculation (TPos, FPos and FNeg), while the count of True Negatives is disregarded.

This means that your second model may have a slightly improved True Positive detection rate but significantly decreased the True Negative detection rate to a higher extent. This would result in better Precision and Recall, as these metrics ignore TNeg; but would negatively affect the overall AUC-ROC value, especially if the increased performance in TPos detection is surpassed by an even higher decrease in performance of TNeg detection.

The fact the your dataset is imbalanced may enhance this effect even further.

Worse AUC but better metrics (Recall, Precision) on a classification problem - How can this happen?

2 Answers2