Wikipedia has an article about the Brier score whose notation confuses me.
The article starts out easy enough by defining the Brier score to be:
$$ BS = \dfrac{1}{N}\overset{N}{\underset{i = 1}{\sum}}\left( f_t - o_t\right)^2 $$
$N$ is the sample size.
$o_t$ is true value $t$, either $0$ or $1$.
$f_t$ is probability forecast $t$ giving the predicted probability of an outcome of category $1$.
This is the usual way I would think of Brier score as mean squared error.
Where the article loses me is when it gets into decomposing the Brier score into reliability/resolution/uncertainty and calibration/refinement. The notation is as follows, remarking about an abuse of the equals sign.
$$ BS = REL - RES + UNC = \left[\dfrac{1}{N}\overset{K}{\underset{K = 1}{\sum}}n_k\left(f_k - \bar o_k\right)^2\right] +\left[ -\dfrac{1}{N}\overset{K}{\underset{K = 1}{\sum}}n_k\left(\bar o_k - \bar o\right)^2 \right] + \left[ \bar o\left(1 - \bar o\right) \right] $$$$ BS = CAL + REF = \left[\dfrac{1}{N}\overset{K}{\underset{K = 1}{\sum}}n_k\left(f_k - \bar o_k\right)^2\right] + \left[ \dfrac{1}{N}\overset{K}{\underset{K = 1}{\sum}}n_k\left(\bar o_k\left(1 - \bar o_k\right)\right) \right] $$
The notation also puts these in bold, which I think it to handle the case of having more than just two outcome categories (addressed earlier in the article).
Could someone unpack this notation to give the explicit equations for reliability, resolution, uncertainty, refinement, and calibration, valid for both binary and multi-class outcomes?
This goes through the intuition but not the explicit computations that I seek here.
EDIT
I have gone through the calculations proposed in an answer by picky_porpoise, and they do not match up with what I should be getting.
library(SpecsVerification)
set.seed(2023)
N <- 100
# p <- rep(seq(0, 1, 1/50), 3)
# y <- rbinom(length(p), 1, p)
p <- rbeta(N, 1, 1)
y <- rbinom(N, 1, p)
res <- function(y, p){
Determine the unique values in the p vector
u <- unique(p)
Loop over the unique values
summands <- rep(NA, length(u)) # Hold the loop results
for (i in 1:length(u)){
# Calculate the o_k
#
o_k <- sum(y[p == u[i]])
# Calculate the n_k
#
n_k <- length(which(p == u[i]))
# Calculate the o-bar
#
o_bar <- mean(y)
# Put the pieces together
#
summands[i] <- n_k * ((o_k/n_k) - o_bar)^2
}
Add up the summands
my_sum <- sum(summands)
Divide the sum by the sample size
return(my_sum/length(y))
}
rel <- function(y, p){
Determine the unique values in the p vector
u <- unique(p)
Loop over the unique values
summands <- rep(NA, length(u)) # Hold the loop results
for (i in 1:length(u)){
# Calculate the o_k
#
o_k <- sum(y[p == u[i]])
# Calculate the n_k
#
n_k <- length(which(p == u[i]))
# Calculate the p_k
#
p_k <- u[i]
# Put the pieces together
#
summands[i] <- n_k * ((o_k/n_k) - p_k)^2
}
Add up the summands
my_sum <- sum(summands)
Divide the sum by the sample size
return(my_sum/length(y))
}
unc <- function(y, p){
return(mean(y) * (1 - mean(y)))
}
rel(y, p) # 0.1881283
res(y, p) # 0.2499
unc(y, p) # 0.2499
SpecsVerification::BrierDecomp(p, y) # 0.011184570, 0.07068499, 0.2499000
rel(y, p) -
res(y, p) +
unc(y, p) -
mean((y - p)^2) # Zero difference between the Brier score decomposition
# calculation and the Brier score!