7

When the outcome of a supervised learning problem is binary and probabilities are predicted, Brier score can be decomposed into a measure of calibration and a measure of discrimination.

Computationally, Brier score is mean squared error, a common performance metric in regression. Is there a decomposition of mean squared error into calibration and discrimination that makes sense for a numerical (not categorical) outcome?

For what it's worth, when I shoehorn such a situation into some Python code, the function runs.

import numpy as np
from model_diagnostics.scoring import SquaredError, decompose
np.random.seed(2023)
N = 1000
x = np.random.beta(1, 1, N)
e1 = np.random.normal(0, 1, N)
e2 = e1/10
e3 = e1/100
decompose(x + e1, x, scoring_function = SquaredError())
decompose(x + e2, x, scoring_function = SquaredError())
decompose(x + e3, x, scoring_function = SquaredError())
Dave
  • 62,186

2 Answers2

2

Yes it does. As explained in my answer to this question we can understand Murphy's decomposition $$ B(p) = \mathrm{REL} - \mathrm{RES} + \mathrm{UNC} $$ of the Brier score $B$ of the predictions $p$ by defining two additional forecasts. We let $\bar o$ be the marginal or base rate forecast and $q$ the re-calibrated version of $p$ to obtain \begin{align*} \mathrm{REL} &= B(p) - B(q) \\ \mathrm{RES} &= B(\bar o) - B(q) \\ \mathrm{UNC} &= B(\bar o) \end{align*} For the mean squared error (in principle for any loss function) $S(x,y) = (x - y)^2$ we can apply the same principle to get a decomposition. Let $X_t$ be a forecast for the mean of $Y_t$. Then we can define

  • $\mathbb{E} Y_t$ as a marginal forecast.
  • $ \mathbb{E} (Y_t | X_t )$ as a re-calibrated version of the forecast $X_t$. This makes sense because if the forecast was reliable (or calibrated) then on all events where $X_t=x$ holds we would believe the expectation of $Y_t$ to be $x$, giving $\mathbb{E} (Y_t | X_t =x) = x$. In reality, we would estimate this value via some sort of regression with $Y_t$ as dependent variable.

With these two values we can then define \begin{align*} \mathrm{REL} &= \mathbb{E}S(X_t, Y_t) - \mathbb{E}S(\mathbb{E}(Y_t | X_t) , Y_t) \\ \mathrm{RES} &= \mathbb{E}S(\mathbb{E}Y_t, Y_t) - \mathbb{E}S(\mathbb{E}(Y_t | X_t) , Y_t) \\ \mathrm{UNC} &= \mathbb{E}S(\mathbb{E}Y_t, Y_t) = \mathrm{Var} (Y_t) \end{align*} and get a calibration/reliability-resolution decomposition of the mean squared error. (Whether naming it calibration or reliability depends on taste, I suppose. The resolution is sometimes also called 'discrimination'.) The resolution and reliability terms simplify to \begin{align*} \mathrm{REL} &= \mathbb{E}(\mathbb{E}(Y_t | X_t) - X_t)^2 \\ \mathrm{RES} &= \mathrm{Var}(\mathbb{E}(Y_t | X_t)) \end{align*} due to the properties of the squared error, see the references for details. A key difference here is that the decomposition of Murphy was developed, and is usually presented, in terms of realized forecasts and observations. This decomposition is defined in terms of population values, i.e. it represents a property of (an idealized) data and forecast generation process.

The python package model-diagnostics you used in your code in fact estimates the population values in this reliability-resolution decomposition based on the data you put in. The tricky part is then to find a good estimator.

References

  1. I find The Murphy Decomposition and the Calibration-Resolution Principle: A New Perspective on Forecast Evaluation by Pohle readable. They also use 'calibration' instead of 'reliability'.

  2. The documentation of the model-diagnostics function decompose has some further reference on what kind of decomposition they compute.

  • For the mean squared error (in principle for any loss function) $S(x,y) = (x - y)^2$: there is no mean in $S$, or is there? – Richard Hardy Dec 03 '23 at 16:27
  • @RichardHardy No, there is not. The decomposition is for the mean squared error, but $S$ is just the squared error, of course. – picky_porpoise Dec 04 '23 at 20:13
  • OK. Because the sentence reads as if mean squared error were a loss function, while it is not (squared error is, but not mean squared error), and as if $S$ were mean squared error. – Richard Hardy Dec 08 '23 at 11:36
1

The Brier score is for the case where the “predicted” value is reduced to a single number, optimally from another dataset than then one upon which predictive accuracy is being assessed (alternatively one can bootstrap the process to de-bias the Brier score as done here). An analogy to linear models would then be fitting a model to new data where the estimated linear predictor $X\hat{\beta}$ is allowed to be recalibrated. If the recalibration is linear, the new model has $\alpha + \tau X\hat{\beta}$. One can compute sum of squared errors with and without forcing $\alpha=0, \tau=1$ to get a decomposition of $\hat{\sigma}^2$ that assesses (linear) calibration.

Frank Harrell
  • 91,879
  • 6
  • 178
  • 397