How does the Brier Score break down to (Reliability - Resolution + Uncertainty)?

Question

The Wikipedia page states this in the decompositions section, and it is also stated in an older paper

I have never been able to understand these explanations, and I wonder what I am missing and if there is an easy way to understand this...

How does one get from this...

$$ BS = \frac{1}{N}\sum\limits _{t=1}^{N}\sum\limits _{i=1}^{R}(f_{ti}-o_{ti})^2 $$

To this...

$$ \begin{aligned} BS&=REL-RES+UNC \\ BS&=\frac{1}{N}\sum\limits _{k=1}^{K}{n_{k}(\mathbf{f_{k}}-\mathbf{\bar{o}}_{\mathbf{k}})}^{2}-\frac{1}{N}\sum\limits _{k=1}^{K}{n_{k}(\mathbf{\bar{o}_{k}}-\bar{\mathbf{o}})}^{2}+\mathbf{\bar{o}}\left({1-\mathbf{\bar{o}}}\right) \end{aligned} $$

A simple derivation of this identity can be found in the appendix of this paper, see also this question for some more context. — picky_porpoise, Nov 16 '23 at 20:07

olooney · Accepted Answer · 2022-12-11T04:24:35.407

Brier Score decomposition relies on a foundational concept in statistics called the partition of sums of squares. The basic idea is that we have a sums of squares over any series, we can break that series into arbitrary groups, calculate the sums of squares within each group, calculate the sums of squares between each group using only group means, and then the total sums of squares is equal to the sums of squares of each group plus the sums of squares between the group means.

$$ \text{Total SS} = \text{Group SS} + \text{Residual SS} \tag{1} $$

Wikipedia has a proof of this theorem. There's a neat little cancelation that occurs that makes everything very neat. This cancelation only occurs for sums of squares in particular. Conceptually, it's related to the fact that when we add independent and normally distributed random variables together, the variance of the sum is equal to the sum of variances.

Although this theorem is often applied to variance and discussed in the context of linear regression, the theorem itself is simple algebra and can be used in a wide variety of contexts. Also, we have total freedom as to how we break the series into groups. For example, for a clustering algorithm I might want to talk about "Within Group Variance" and "Between Group Variance". For ANOVA I might want to talk about the "proportion of variance explained." In each case, we simply decompose the sums of squares into a group level contribution and an overall contribution and assign interpretations to each part. The trick is to break the series into meaningful groups so that a meaningful interpretation can be assigned to each group.

For Brier scores, it's assumed that we've broken our forecasts into groups so that we only have a small number of distinct forecasts. In practice this is almost always done with a bucketing strategy. For example, I might divide my forecasts into 5 groups: everything between 0 and 0.2 is treated as 0.1, everything between 0.2 and 0.4 is treated as 0.3, and so on. The number of buckets you can get away with depends on how much data you have, but in practice 5 is usually OK when sample sizes are small, and 10 is usually more than enough to calculate the calibration metrics. You can see scikit-learn's calibration_curve() for a practical example.

In fact, one interpretation of the calibration/reliability component of a Brier score decomposition is that it's simply the MSE of a calibration curve.

Once you've divided your forecasts up into a finite number of buckets, you can calculate the Brier score and its components:

$$ \underbrace{\frac{1}{N}\sum\limits _{t=1}^{N}\sum\limits _{i=1}^{R}(f_{ti}-o_{ti})^2 \,\!}_{\text{Brier Score}} = \underbrace{\frac{1}{N}\sum\limits _{k=1}^{K}{n_{k}(f_{k}-\bar{o}_{k})}^{2}}_{\text{Calibration/Reliability}} \underbrace{ - \underbrace{\frac{1}{N}\sum\limits _{k=1}^{K}{n_{k}(\bar{o}_{k}-\bar{o})}^{2}}_{\text{Resolution}} + \underbrace{\bar{o}\left({1-\bar{o}}\right)}_{\text{Uncertainty}} }_{\text{Refinement}} \tag{2} $$

We start by breaking the Brier score into two components: calibration and refinement:

The first term, called calibration, calibration error, or reliability, indicates how close the assigned probabilities are to the actual probability in each bucket. For example, for a bucket where the forecasted probability is 10%, we would want about 10% of the examples in that bucket to be positives. If 50% are positive, or only 2%, then our prediction was poorly calibrated.
The second term, refinement, indicates how good the predictions are after controlling for poor calibration. In this way, it's similar to AUC in that a model with a high predictive power will still get a low (lower is better) refinement score regardless of how poorly calibrated it is.

Sometimes it's useful to further decompose the refinement into two separate terms - resolution and uncertainty.

Resolution indicates the forecasters ability to go out on a limb and make non-trivial predictions. If every forecast is exactly equal to the prevalence - if the model is basically just spitting out the same constant value for every example - then resolution will be low. Only if the forecaster goes out on a limb and actually tries to adjust their forecast up or down based on knowledge will they get a good resolution score.
Uncertainty is the only term which does not include the forecast probability at all. This represents the uncertainty of the actual outcome. If outcomes are balanced and the overall prevalence is 50%, then uncertainty will be a maximum of 0.25. If there an imbalance between positives and negatives, uncertainty will be lower. For example, if only 10% of outcomes are positives, then uncertainty will be $0.1 (1 - 0.1) = 0.09$. You might notice this is exactly the same formula as the variance of Bernoulli trial.

The point of Brier score decomposition is to diagnose some of the common ways that a model or a forecaster can go wrong. If they're just straight up terrible, they'll get a low Brier score. If they're pretty good overall but tend to be too conservative or aggressive, or if they incorrectly believe events that only happen 95% of the time are a "sure thing" or that events that happen 5% of the time "can never happen," or if they never put out a score less than 10% because "you never know" even though the actual rate for the bottom bucket is 1%, then they'll be poorly calibrated. If they play it safe and always make a forecast exactly at or near the overall prevalence, then they'll get a poor resolution score.

If you have a model which you believe to be poorly calibrated, you can sometimes do a post hoc correction to recalibrate the output. This is useful for models like neural networks which do not claim to emit a probability, or models like naïve Bayes which are known to crash high or low when their assumptions are violated. You should be careful with this though, as you could simply be papering over a fundamental mismatch between your model and the actual data.

score 2 · Answer 2 · answered Dec 07 '22 at 15:39

The Brier score (BS) is a measure of the accuracy of probabilistic predictions. It is calculated by taking the mean squared error between the predicted probabilities and the true outcomes.

The first equation you provided is the general form of the Brier score, which takes into account all of the predicted probabilities for all of the individual predictions that have been made.

The second equation you provided decomposes the Brier score into three components: the reliability (REL), the resolution (RES), and the uncertainty (UNC). These three components provide insight into the different factors that contribute to the overall Brier score, and allow you to analyze the performance of a probabilistic prediction system in more detail.

The two equations are equivalent.

The relative accuracy measures how close the predicted probabilities are to the observed frequencies. The resolution measures how well the predictions can distinguish between events with different frequencies. And the uncertainty measures the average predicted probability.

The second equation is obtained from the first one by introducing these three components and rearranging the terms. In particular, the relative accuracy is defined as:

$$ REL = \frac{1}{N}\sum\limits _{k=1}^{K}{n_{k}(\mathbf{f_{k}}-\mathbf{\bar{o}}_{\mathbf{k}})}^{2} $$

The resolution is defined as:

$$ RES = -\frac{1}{N}\sum\limits _{k=1}^{K}{n_{k}(\mathbf{\bar{o}_{k}}-\bar{\mathbf{o}})}^{2} $$

And the uncertainty is defined as:

$$ UNC = \mathbf{\bar{o}}\left({1-\mathbf{\bar{o}}}\right) $$

The Brier score is then obtained by adding these three components:

$$ BS = REL + RES + UNC $$

I hope this helps clarify the relationship between the two equations you provided.

Thanks for your answer. You said "The two equations are equivalent." I am looking more for a derivation from the first equation to the second equation and not an intuitive explanation of the meaning of the terms. There is something I am missing because I cannot get from the first equation to the second equation. — Joff, Dec 09 '22 at 00:55

How does the Brier Score break down to (Reliability - Resolution + Uncertainty)?

2 Answers2

Linked