4

Consider a binary classification problem, where the goal is to use training data $(x_i,y_i)_{i=1}^n$ to fit a classifier $f: \mathbb{R}^d \rightarrow [0,1]$ that outputs a conditional probability estimate (e.g. $f$ could be a logistic regression model).

The general way to check if the predicted probabilities match the true probabilities (i.e., are ``well-calibrated") seems to be a reliability plot. This plots the probabilities on the x-axis, and the observed probabilities on the y-axis.

I am looking for a performance metric that could be used instead of the reliability plot? Ideally, I'd like to find a metric that is used in the statistics or ML literature.

Berk U.
  • 5,025
  • I believe the metrics you're looking for are essentially the log loss and the Brier score? A few references:
    1. https://arxiv.org/pdf/2002.06470.pdf
    2. https://link.springer.com/content/pdf/10.1007/s10115-013-0670-6.pdf
    3. https://arxiv.org/pdf/2112.12843.pdf
    – Eike P. Mar 10 '22 at 20:27

1 Answers1

4

I ended up finding several measures in the literature (see e.g. CAL and MXE in the the paper Data Mining in Metric Space: An Empirical Analysis of Supervised Learning Performance Criteria by Caruana and Niculescu Mizil).

The most useful measure appears to be the Mean Calibration Error (CAL), which is the weighted root-mean squared error (RMSE) between predicted probabilities and true probabilities on a calibration plot. Formally:

$$\text{CAL} = \frac{1}{N} \sqrt{\sum_{k=1}^K \sum_{i \in B_k} {(p_k - \hat{p}_i})^2}$$

where:

  • $\hat{p}_i$ is the predicted probability for example $i = 1,\ldots,N$
  • $p_k$ is the observed probability for examples in bin $k$

Here, the binning is required because we do not typically have a "true" probability for each example $p_i$, only a label $y_i$. Thus, we construct $K$ bins (e.g., $B_1 = [0,0.1)$, $B_2 = [0.1,0.2)$...), and then estimate the observed probabilities for each bin as:

$$\hat{p}_k = \frac{1}{|B_k|}\sum_{i\in B_k} 1[y_i=1]$$

CAL is an intuitive summary statistic, but it does have several shortcomings. In particular:

  • Since CAL is weighted by the number of observations, CAL can fudge local calibration issues over the full reliability diagram. If, for instance, 95% of your observations could fall into the first bin $\hat{p}_i \in [0,0.05)$ where you predict well... However, you could be completely off in the remaining cases.

  • CAL depends on the binning procedure. This is why some people use a smoothed estimate (e.g., Caruana and Niculescu Mizil). This is not true in settings where classifiers output a discrete set of predictions (e.g., for risk scores)

Berk U.
  • 5,025
  • 1
    Wouldn't a model that just predicts the average probability for every data point get a near-perfect CAL score? – Bridgeburners Aug 01 '19 at 14:41
  • It does! I couldn't find a better summary statistic for calibration, but I did come up with some best practices that I'll add here. In short: (1) always report CAL & AUC -- models that predict near the average probability for all points typically don't rank well so in cases like these the model will have low AUC. (2) Don't use CAL as a model selection metric (e.g., choosing a model that optimizes K-CV CAL tends to favor models that perform badly). – Berk U. Aug 31 '19 at 21:57
  • Where did you take that definition of CAL? In the paper "Data Mining in Metric Space" it is defined using absolute differences instead of squares. – nikkou Sep 13 '19 at 10:02
  • 1
    Note that CAL and related calibration metrics are all sample size-biased: metric values will tend to be larger on smaller samples, given an equally well-calibrated model. (Especially problematic if comparing calibration between samples of different sizes.) See e.g. https://proceedings.mlr.press/v151/roelofs22a.html and https://arxiv.org/pdf/2302.08851.pdf. Disclaimer, I'm the first author on the latter one. We also propose a simple-to-use and unbiased calibration error metric. – Eike P. Sep 06 '23 at 21:55