4

Wikipedia has an article about the Brier score whose notation confuses me.

The article starts out easy enough by defining the Brier score to be:

$$ BS = \dfrac{1}{N}\overset{N}{\underset{i = 1}{\sum}}\left( f_t - o_t\right)^2 $$

$N$ is the sample size.

$o_t$ is true value $t$, either $0$ or $1$.

$f_t$ is probability forecast $t$ giving the predicted probability of an outcome of category $1$.

This is the usual way I would think of Brier score as mean squared error.

Where the article loses me is when it gets into decomposing the Brier score into reliability/resolution/uncertainty and calibration/refinement. The notation is as follows, remarking about an abuse of the equals sign.

$$ BS = REL - RES + UNC = \left[\dfrac{1}{N}\overset{K}{\underset{K = 1}{\sum}}n_k\left(f_k - \bar o_k\right)^2\right] +\left[ -\dfrac{1}{N}\overset{K}{\underset{K = 1}{\sum}}n_k\left(\bar o_k - \bar o\right)^2 \right] + \left[ \bar o\left(1 - \bar o\right) \right] $$$$ BS = CAL + REF = \left[\dfrac{1}{N}\overset{K}{\underset{K = 1}{\sum}}n_k\left(f_k - \bar o_k\right)^2\right] + \left[ \dfrac{1}{N}\overset{K}{\underset{K = 1}{\sum}}n_k\left(\bar o_k\left(1 - \bar o_k\right)\right) \right] $$

The notation also puts these in bold, which I think it to handle the case of having more than just two outcome categories (addressed earlier in the article).

Could someone unpack this notation to give the explicit equations for reliability, resolution, uncertainty, refinement, and calibration, valid for both binary and multi-class outcomes?

This goes through the intuition but not the explicit computations that I seek here.

EDIT

I have gone through the calculations proposed in an answer by picky_porpoise, and they do not match up with what I should be getting.

library(SpecsVerification)
set.seed(2023)
N <- 100
# p <- rep(seq(0, 1, 1/50), 3)
# y <- rbinom(length(p), 1, p)

p <- rbeta(N, 1, 1) y <- rbinom(N, 1, p)

res <- function(y, p){

Determine the unique values in the p vector

u <- unique(p)

Loop over the unique values

summands <- rep(NA, length(u)) # Hold the loop results for (i in 1:length(u)){

# Calculate the o_k
#
o_k &lt;- sum(y[p == u[i]])

# Calculate the n_k
#
n_k &lt;- length(which(p == u[i]))

# Calculate the o-bar
#
o_bar &lt;- mean(y)

# Put the pieces together
#
summands[i] &lt;- n_k * ((o_k/n_k) - o_bar)^2

}

Add up the summands

my_sum <- sum(summands)

Divide the sum by the sample size

return(my_sum/length(y)) }

rel <- function(y, p){

Determine the unique values in the p vector

u <- unique(p)

Loop over the unique values

summands <- rep(NA, length(u)) # Hold the loop results for (i in 1:length(u)){

# Calculate the o_k
#
o_k &lt;- sum(y[p == u[i]])

# Calculate the n_k
#
n_k &lt;- length(which(p == u[i]))

# Calculate the p_k
#
p_k &lt;- u[i]

# Put the pieces together
#
summands[i] &lt;- n_k * ((o_k/n_k) - p_k)^2

}

Add up the summands

my_sum <- sum(summands)

Divide the sum by the sample size

return(my_sum/length(y))

}

unc <- function(y, p){ return(mean(y) * (1 - mean(y))) }

rel(y, p) # 0.1881283 res(y, p) # 0.2499 unc(y, p) # 0.2499 SpecsVerification::BrierDecomp(p, y) # 0.011184570, 0.07068499, 0.2499000

rel(y, p) - res(y, p) + unc(y, p) - mean((y - p)^2) # Zero difference between the Brier score decomposition # calculation and the Brier score!

Dave
  • 62,186
  • Did you read the cited article that constructed this decomposition? https://doi.org/10.1175%2F1520-0450%281973%29012%3C0595%3AANVPOT%3E2.0.CO%3B2 – Yashaswi Mohanty Nov 15 '23 at 05:36

1 Answers1

5

The appendix of the paper Simplifying and generalising Murphy's Brier score decomposition by Stefan Siegert provides a simpler proof than the original work by Murphy. Let me just give the intuition here. I will use Siegert's notation in order to make the transition from my summary to his proof as simple as possible.

The empirical Brier score for binary outcomes $y_1, \ldots, y_N$ and forecast probabilities $p_1, \ldots, p_N$ is given by $$ B(p) = \frac{1}{N} \sum_{t=1}^{N} (y_t - p_t)^2 . $$ The idea of the decomposition is to look at the performance of the forecasts for each possible value they took. To do this, the following notation is needed:

  • The set of $K$ distinct values which the forecast probabilities assume, denoted via $\{P_1, \ldots, P_K\}$.
  • The number $n_k$ for any $k=1, \ldots, K$ which states how often the forecast took the value $P_k$
  • The number $o_k$ for any $k=1, \ldots, K$ which states how often success was observed, i.e. $y_t=1$ when the forecast took the value $P_k$.
  • The base rate $\bar o = \frac{1}{N} \sum_{t=1}^N y_t$

With these we get $$ B(p) = \mathrm{REL} - \mathrm{RES} + \mathrm{UNC} $$ where \begin{align*} \mathrm{REL} &= \frac{1}{N} \sum_{k=1}^K n_k \left( \frac{o_k}{n_k} - P_k \right)^2 \\ \mathrm{RES} &= \frac{1}{N} \sum_{k=1}^K n_k \left( \frac{o_k}{n_k} - \bar o \right)^2 \\ \mathrm{UNC} &= \bar o (1 - \bar o) \end{align*} Now the key insight to understand why this decomposition holds is to define new 're-calibrated' probability forecasts $q_1, \ldots, q_N$ via $$ q_t = \frac{o_{k(t)}}{n_{k(t)}} $$ where $k(t)$ is the index $k$ such that $p_t = P_k$. The $q_t$ are calibrated in the sense that they always hit the exact event frequencies for the $K$ different forecast cases we consider. With this notation, the decomposition follows from calculating that \begin{align*} \mathrm{REL} &= B(p) - B(q) \\ \mathrm{RES} &= B(\bar o) - B(q) \\ \mathrm{UNC} &= B(\bar o) \end{align*} such that the new terms $B(q)$ and $B(\bar o)$ cancel out and only $B(p)$ remains. Apart from enabling an easy proof of the decomposition, these equations also provide an intuitive interpretation of the components:

  • The reliability (REL) component is the difference in performance (measured via Brier score) between the forecast and a re-calibrated version of it, where calibration means that the values have been fit to the true event frequencies. It thus measures the performance loss due to getting the event frequencies wrong.
  • The resolution (RES) component is the difference in performance between the base rate forecast and the calibrated version of the forecast. It thus quantifies how good the forecast distinguishes events from non-events, but does so by ignoring the reliability question (this is already in the REL component)
  • The uncertainty (UNC) is the performance of a simple constant base rate forecast. The forecast improves on this by increasing resolution, while keeping the reliability component low, i.e. being reliable

Although intuitive, this type of score decomposition has the disadvantage that the 're-calibration' is trivial. In particular, if all forecast probabilities are distinct (such that $K=N$), which is realistic in many applications, then it becomes useless. An improved decomposition, which should probably be preferred, is derived in Stable reliability diagrams for probabilistic classifiers by Dimitriadis et al.

Edit

  • The decomposition naturally extends to the multi-class setting. The $p_t$ are then probability vectors over the classes and $K$ is the number of distinct probability vectors. The number of successes and the base rate have to be computed per class.

  • The function BrierDecomp from the package SpecsVerification does not yield the decomposition described here. The reason for this is that they calculate re-calibrated forecasts via binning.

  • 1
    In most situations, $k$ will be equal to the number of predictions made, correct? (I guess I expect all of the predictions to be at least a little bit different.) – Dave Nov 16 '23 at 20:14
  • @Dave Probably yes, if there is no restriction on how precise you can make your probabilities, there will be no ties, so $K=N$. – picky_porpoise Nov 16 '23 at 20:19
  • I've tried to implement your equations (edited into my question) but have only gotten the uncertainty to work. Are there errors in your equations? Have a made a coding error? – Dave Nov 16 '23 at 21:10
  • @Dave I think when you compute the $o_k$ you have to use sum instead of mean . Otherwise the sum of successes is divided by $n_k$ two times. – picky_porpoise Nov 17 '23 at 11:33
  • 1
    So Wikipedia’s $f_k$ is what you call $P_k$, and $\bar o_k$ is what you call $o_k/n_k$, right? – Dave Nov 17 '23 at 17:34
  • @Dave I'd rather call it Siegert's notation, not mine, but yes. – picky_porpoise Nov 18 '23 at 07:17
  • I've included a new set of code. You were right in the first simulation that changing mean to sum did the trick, but when I try a new set of probabilities and outcomes, I get strange results. Thoughts? – Dave Nov 20 '23 at 14:50
  • @Dave It looks like you hit the $K=N$ case in your example. In this case RES = UNC and REL is just the Brier score, so the decomposition gives no insights. – picky_porpoise Nov 20 '23 at 17:00
  • The code I wrote, which sure seems to be what your equations say to do, does not agree with the numbers given by the package. – Dave Nov 20 '23 at 17:01
  • @Dave I'm pretty sure the equations represent the score decomposition proposed by Murphy and discussed in the Wikipedia article. It seems the package produces numbers which do not sum up to the Brier score. Probably it doesn't implement the same decomposition then. – picky_porpoise Nov 20 '23 at 18:03
  • The package documentation specifies that it deals with the 1973 Murphy decomposition (with an alternative if you override the Murphy73 default). Do you have an alternative package you would use to do these calculations? I am happy to work in either R or Python. – Dave Nov 20 '23 at 18:19
  • @Dave The documentation also specifies that they compute the $q_t$ via binning, so it is not the same decomposition anymore. This is of course to avoid the $K=N$ case. Your package question is difficult. I would not use this decomposition at all. But my second reference comes with an R package, I think. – picky_porpoise Nov 21 '23 at 08:14
  • I'd really like to see my code give the same answer as a vetted package, perhaps sklearn. Nonetheless, I can get on board with the idea that the package I used does the calculation a bit different, accounting for the different result. $//$ Now how does it work for multi-class problems? (I suspect it would be the same for multi-label.) – Dave Nov 22 '23 at 20:02
  • @Dave The decomposition works essentially the same in the multi-class situation and I added that to my answer. Spelling out all the details is too cumbersome, I'm afraid. – picky_porpoise Nov 25 '23 at 17:00