Why Log loss, AUC and precision & recall change differently when class imbalance problem is tackled?

Question

I have a dataset and I'm working on a binary classification task with it. It has a class imbalance problem where False class versus True class is 10: 1.

If I train a neural network on it directly without tackling the imbalance problem, I got the following result:

test LogLoss 0.2025
test AUC 0.8578
              precision    recall  f1-score   support

       False       0.94      0.99      0.96   2294923
        True       0.84      0.30      0.44    224278

    accuracy                           0.93   2519201
   macro avg       0.89      0.65      0.70   2519201
weighted avg       0.93      0.93      0.92   2519201

After I added class weight to the classifier training process, i.e., class_weight={0: 1., 1: 10.}, I retrained the model and got the following result:

test LogLoss 0.4166
test AUC 0.8646
              precision    recall  f1-score   support

       False       0.97      0.81      0.88   2294923
        True       0.28      0.74      0.41    224278

    accuracy                           0.81   2519201
   macro avg       0.62      0.78      0.65   2519201
weighted avg       0.91      0.81      0.84   2519201

It seems the log loss is worse but the AUC is better. The True class's precision is worse but recall is better.

How do you explain these changes in metrics, why some are better and some worse?
Based on the result,should I use class weight in the training?

they change differently because they are different measures. If they would behave the same, there would not be a need for so many of them. Which is the better model depends on your problem, how much do you care about false positive/negative, calibration etc. That's a 'business' decision, not a machine learning decision. If you want to understand your model then plot distributions of unthresholded predictions for each group. You can also do it before and after adding weights to see how the prediction change — rep_ho, Oct 23 '19 at 10:01
For the "True" instances (the minor class), after using class weight, its precision decreases from 0.84 to 0.28. I found this strange. The class weight should help the classifier better tell the differences between the two classes. Shouldn't all the precision metric increase? — CyberPlayerOne, Oct 23 '19 at 10:40
that's caused by where you put your threshold. Since you include weights, now you are predicting more subjects as positive thus precision goes down and recall goes up. If you just plot two histograms for predictions of positive and negative classes and look where your threshold (0 or 0.5) is with respect to these distributions, everything will be much clearer to you — rep_ho, Oct 25 '19 at 11:28

Dave · Answer 1 · 2023-04-09T16:04:38.610

An interesting property of AUC is that it does not change unless you change the ordering of the points. For instance, if you divide every value by two, the AUC is the same.

library(pROC)
set.seed(2023)
N <- 1000
p <- rbeta(N, 1, 1)
y <- rbinom(N, 1, p)
pROC::roc(y, p)$auc   # I get 0.8481
pROC::roc(y, p/2)$auc # Again, I get 0.8481

In this regard, the AUC does not consider calibration; AUC does not penalize the model for making predictions that are detatched from reality, such as having events with predictions equal to $0.2$ that happen $50\%$ of the time. AUC is strictly a measure of ability to discriminate between categories.

Log loss, however, considers both calibration and discrimination. The function penalizes predictions of category $1$ members for being away from $1$ and predictions of category $0$ members for being away from $0$, so it certainly covers discrimination. Calibration is harder to see from the equation, but by being a strictly proper scoring rule, it can be thought of as seeking out the true conditional probabilities of class membership (which we hope are extreme so we get good discrimination between categories, but we are not assured of that). Brier score, which is another strictly proper scoring rule, has an explicit decomposition into calibration and discrimination.

What this result tells me is that you are harming your calibration without making much improvement to your discrimination, and I would consider this a net negative.

The reason you harm your calibration is that you do not penalize mistakes equally in your loss function. Your loss function is designed to give especially high probabilities of membership in the minority class, and when you go test your model on data that have a low probability of membership in the minority class, the probability is overestimated. This is considered a feature, not a bug, by proponents of weighted loss functions.

Your ability to discriminate between classes has minimal change because the goal of the weighted loss is just to increase the probability values in order to have more predictions of the minority class that are large. This is not a perfect analogy, but it is as if you just divide your predictions by the largest predicted probability value. You do not change the order, so the ability for the model to discriminate between categories does not change, but doing so means that you get higher predicted values.

Mostly, class imbalance is a non-problem for proper statistical methods, and attempts to "fix" class imbalance typically stem from using a threshold of $0.5$ and trying to force your predictions to fall on the correct side of that threshold, which seems to be how the weighted loss function is used here.

However, you do not have to use $0.5$ as a threshold. In fact, you do not have to use any threshold at all, and the raw predictions can be useful. This link gives a good discussion of why and links to other good material.

Why Log loss, AUC and precision & recall change differently when class imbalance problem is tackled?

1 Answers1

Linked