When the outcome of a supervised learning problem is binary and probabilities are predicted, Brier score can be decomposed into a measure of calibration and a measure of discrimination.
Computationally, Brier score is mean squared error, a common performance metric in regression. Is there a decomposition of mean squared error into calibration and discrimination that makes sense for a numerical (not categorical) outcome?
For what it's worth, when I shoehorn such a situation into some Python code, the function runs.
import numpy as np
from model_diagnostics.scoring import SquaredError, decompose
np.random.seed(2023)
N = 1000
x = np.random.beta(1, 1, N)
e1 = np.random.normal(0, 1, N)
e2 = e1/10
e3 = e1/100
decompose(x + e1, x, scoring_function = SquaredError())
decompose(x + e2, x, scoring_function = SquaredError())
decompose(x + e3, x, scoring_function = SquaredError())