25

Since I've heard about proper scoring rules for binary classification like the Brier score or Log Loss, I am more and more convinced that they are drastically underrepresented in practice in favor of measures like accurary, ROC AUC or F1. As I want to drive forward a shift to proper scoring rules for model comparison in my organization, there is one common argument that I cannot fully answer:

If there is extreme class imbalance (e.g. 5 positive cases vs 1,000 negative cases), how does the Brier score ensure that we select the model that gives us the best performance regarding high probability forecasts for the 5 positive cases? As we do not care if the negative cases have predictions near 0 or 0.5 as long as they are relatively lower than those for the positive classes.

I have two possible answers available right now but would love to hear expert opinions on this topic:

1."The Brier score as a proper scoring rule gives rare events the appropriate weight that they should have on the performance evaluation. Discriminative power can further be examined with ROC AUC."

This follows the logic of Frank Harrell's comment to a related question: "Forecasts of rare events have the "right" effect on the mean, i.e., mean predicted probability of the event = overall proportion of events. The Brier score works no matter what the prevalence of events." As he further suggests there, one could supplement the Brier score with ROC AUC to examine to which extent the desired relative ranking of positive over negative cases was achieved.

2."We can use stratified Brier score to equally weight the forecast performance regarding each class."

This follows the logic of this papers argumentation: "Averaging the Brier score of all the classes gives the stratified Brier score. The stratified Brier score is more appropriate when there is class imbalance since it gives equal importance to all the classes and thus allows any miscalibration of the minority classes to be spotted.". I am not sure whether the loss of the strictly proper scoring rule property is worth the heavier weighting of the minority class of interest and whether there is a statistical sound foundation to use this somehow arbitrary way of reweighting ("If we follow this approach, what stops us from going further and weighting the minority class 2, 17, or 100 times as much as the other class?").

stat2739
  • 253
  • 1
    Of possible interest: https://stats.stackexchange.com/questions/464636/proper-scoring-rule-when-there-is-a-decision-to-make-e-g-spam-vs-ham-email – Dave Sep 25 '20 at 10:18
  • 6
    With so much emphasis on this site about the superiority of strictly proper scoring rules, the fact that there is no one-size-fits-all strictly proper scoring rule sometimes gets missed. The Brier score might not be the best choice if rare events are involved. I made something of a crack at that issue here, and there's some discussion here, but I don't think that either provides the answer to this question. – EdM Sep 25 '20 at 15:57
  • Thank you both for these links! They provide very valuable insights, but yes, do not directly answer my question. Hoping for further answers to clarify the issue a little bit more. – stat2739 Sep 30 '20 at 11:22

3 Answers3

14

If there is extreme class imbalance (e.g. 5 positive cases vs 1,000 negative cases), how does the Brier score ensure that we select the model that gives us the best performance regarding high probability forecasts for the 5 positive cases? As we do not care if the negative cases have predictions near 0 or 0.5 as long as they are relatively lower than those for the positive classes.

This depends crucially on whether we can separate subpopulations with different class probabilities based on predictors. As an extreme example, if there are no (or no useful) predictors, then predicted probabilities for all instances will be equal, and requiring lower predictions for negative vs. positive classes makes no sense, whether we are looking at Brier scores or other loss functions.

Yes, this is rather obvious. But we need to keep it in mind.

So let's look at the second simplest case. Assume we have a predictor that separates our population cleanly into two subpopulations. Among subpopulation 1, there are 4 positive and 200 negative cases. Among subpopulation 2, there is 1 positive and 800 negative cases. (The numbers match your example.) And again, there is zero possibility of further subdividing the subpopulations.

Then we will get constant predicted probabilities to belong to the positive class $p_1$ for subpopulation 1 and $p_2$ for subpopulation 2. The Brier score then is

$$ \frac{1}{5+1000}\big(4(1-p_1)^2+200p_1^2+1(1-p_2)^2+800p_2^2\big). $$

Using a little calculus, we find that this is optimized by

$$ p_1 = \frac{1}{51} \quad\text{and}\quad p_2=\frac{1}{801}, $$

which are precisely the proportions of positive classes in the two subpopulations. Which in turn is as it should be, because this is what the Brier score being proper means.

And there you have it. The Brier score, being proper, will be optimized by the true class membership probabilities. If you have predictors that allow you to identify subpopulations or instances with a higher true probability, then the Brier score will incentivize you to output these higher probabilities. Conversely, if you can't identify such subpopulations, then the Brier score can't help you - but neither can anything else, simply because the information is not there.

However, the Brier score will not help you in overestimating the probability in subpopulation 1 and in underestimating the probability in subpopulation 2 beyond the true values $p_1=\frac{1}{51}$ and $p_2=\frac{1}{801}$, e.g., because "there are more positive cases in subpopulation 1 than in 2". Yes, that is so, but what use would over-/underestimating this value be? We already know about the differential based on the differences in $p_1$ and $p_2$, and biasing these will not serve us at all.

In particular, there is nothing an ROC analysis can help you with beyond finding an "optimal" threshold (which I pontificate on here). And finally, there is nothing in this analysis that depends in any way on classes being balanced or not, so I argue that unbalanced datasets are not a problem.

Finally, this is why I don't see the two answers you propose as useful. The Brier score helps us get at true class membership probabilities. What we then do with these probabilities will depend on our cost structure, and per my post on thresholds above, that is a separate problem. Yes, depending on this cost structure, we may end up with an algebraically reformulated version of a stratified Brier score, but keeping the statistical and the decision theoretic aspect separate keeps the process much cleaner.

Stephan Kolassa
  • 123,354
  • 4
    Any idea why we don't call it just Mean Square Error (MSE)? – Azim Jun 24 '21 at 03:26
  • 5
    @Azim: yes, the Brier score is just the MSE of predicted probabilities. As to why it's not called that, it's just nomenclature that gained traction among a scientific discipline that is more accustomed to thinking in terms of scores, rather than errors. I don't think there is more to it, and I don't think there would be much of a point in trying to change the consensus terminology... – Stephan Kolassa Jun 24 '21 at 05:50
  • @StephanKolassa I have been following your analysis and commentary on the imbalanced data issue with great interest. I am seeing arguments that Brier score has undesirable properties in some applications (https://arxiv.org/pdf/physics/0401046v1.pdf) but I'm guessing we could use another proper scoring rule? – Paul Nov 23 '22 at 23:15
  • 1
    @Paul: thanks for that paper, I'll take a look. In the meantime, we do have a thread that compares the log loss to the Brier score: Why is LogLoss preferred over other proper scoring rules? I may include that paper in the thread if I find a new argument in there. – Stephan Kolassa Nov 24 '22 at 08:27
  • 1
    @Paul: it turns out that I had already read that paper in 2020, and that it is actually included in that other thread. Jewson's argument is essentially that the Brier score does not penalize strong probabilistic underforecasts enough, most blatantly if we predict a zero probability for an event that does occur. He therefore recommends the log score. I concur completely. However, the log score will explode in this situation. Some people consider this a bug. I (and presumably Jewson) consider it a feature. – Stephan Kolassa Nov 24 '22 at 10:06
  • 1
    @Paul: you may also be interested in the scoring rules tag wiki, in particular in the paper by Merkle & Steyvers (2013) cited there. – Stephan Kolassa Nov 24 '22 at 10:07
  • I believe an issue arrises when there is a very unbalanced dataset. For example, data sample of 10000 has 10 positive examples. In this case a model that predicts everything as negative, gets a score of 0.001. While a model that gets, 80% positive cases and 80% of negative cases gets a score of 0.2. In this case stratified Brier score makes sense. – Akavall Nov 30 '22 at 03:46
  • @Akavall: the Brier score evaluates probabilistic predictions. It will correctly prefer a predicted $\hat{p}=0.001$ in your case over any other prediction (in expectation). Of course, it will also wrongly prefer $\hat{p}=0$ over any $\hat{p}>.002$, but the gradient pulls us toward the correct $\hat{p}=0.001$. It's also a reason to prefer the log score. We can only discuss "getting cases correct" if we specify a threshold, which I would argue conflates the prediction with the decision aspect. – Stephan Kolassa Nov 30 '22 at 12:11
  • @StephanKolassa, I am talking about using Brier score to evaluate a model after it was trained (not to be used as a loss function, or maybe I misunderstand you comment about gradient). And in this case, my point remains, for a very highly unbalanced problem that completely ignores the minority class will have a good Brier score. The reason for this is that getting a prediction completely wrong only costs 1 (as opposed to infinity in log loss). – Akavall Dec 01 '22 at 00:15
  • @Akavall: if you are not talking about the Brier score to evaluate a model after it was fitted, then I have to admit I don't quite understand what you are discussing. My point also holds for using the Brier score as a loss function during training. I think I addressed your imbalance with my previous comment: of course, completely ignoring the minority class (i.e., $\hat{p}=0$) will lead to a lower Brier score than a high overprediction, but a calibrated prediction will be better than either. And log vs. Brier score is addressed in the first link in my comment. – Stephan Kolassa Dec 03 '22 at 08:38
  • @StephanKolassa, Yes, I am indeed talking about the Brier score to evaluate a model after it was fitted. But isn't the fact that models that assign same low score to all predictions 0 or 0.001 score well according to Brier Score, a shortcoming of Brier Score? Such models would be of limited practical use, no? – Akavall Dec 04 '22 at 03:29
  • @Akavall: yes, I completely agree. Which is why I believe the log score is superior, although there are arguments for the Brier over the log score. See this thread I linked to above. – Stephan Kolassa Dec 04 '22 at 05:27
5

The paper "Class Probability Estimates are Unreliable for Imbalanced Data (and How to Fix Them)" (Wallace & Dahabreh 2012) argues that the Brier score as is fails to account for poor calibrations in minority classes. They propose a stratified Brier score:

$$BS^+ = \frac{\sum_{y_i=1}\left(y_i- \hat{P}\left\{y_i|x_i\right\}\right)^2}{N_{pos}}$$ $$BS^- = \frac{\sum_{y_i=0}\left(y_i- \hat{P}\left\{y_i|x_i\right\}\right)^2}{N_{neg}}$$

Unfortunately this does not give you a single metric with which to optimize, but you could take the maximum of the stratified Brier Scores for your model to make your descision based on the worst performance over all classes.

As an aside the authors point out that the probability estimates obtained using Platt Scaling are woefully inaccurate for the minority class as well. To remedy this some combination of undersampling and bagging is proposed.

MCR
  • 51
0

If there is extreme class imbalance (e.g. 5 positive cases vs 1,000 negative cases), how does the Brier score ensure that we select the model that gives us the best performance regarding high probability forecasts for the 5 positive cases? As we do not care if the negative cases have predictions near 0 or 0.5 as long as they are relatively lower than those for the positive classes

It doesn't ensure that, see my counter example here:

Why is accuracy not the best measure for assessing classification models?

That doesn't mean the Brier score isn't a good idea, just that it is no panacea (because it doesn't take into account the purpose of the analysis and just measures the quality of the probability estimates everywhere according to the data density).

Dikran Marsupial
  • 54,432
  • 9
  • 139
  • 204