7

A 2020 NeuroIPS paper by Gupta, Podkopaev & Ramdas addresses the calibration of outputs to binary “classification” models, admitting that the raw scores, despite perhaps being on $\left[0, 1\right]$, need not have literal interpretations as probabilities until they have been assessed for calibration and, if necessary, adjusted to reflect the reality of event occurrence (e.g., Platt scaling).

Throughout the paper, there is reference to distribution assumptions and how various calibration techniques are sensitive to those distribution assumptions. As far as I can tell, in a binary “classification” model, the outcome is conditionally Bernoulli, end of discussion. Thus, it does not seem like there are any distributional assumptions to make. I would be on board with the idea of multiple possible distribution assumptions for an outcome that is more complex than binary (all values on the continuum are possible, for instance, could be conditionally Gaussian, conditionally t-distributed, etc). For the binary outcome, though, that seems bizarre. The outcome is conditionally Bernoulli, and that’s that.

What do Gupta, Podkopaev & Ramdas see differently? What distribution assumptions do they think could be made with a binary outcome?

Reference:

Gupta, Chirag, Aleksandr Podkopaev, and Aaditya Ramdas. "Distribution-free binary classification: prediction sets, confidence intervals and calibration." Advances in Neural Information Processing Systems 33 (2020): 3711-3723.

User1865345
  • 8,202
Dave
  • 62,186

1 Answers1

1

As far as I can tell, in a binary “classification” model, the outcome is conditionally Bernoulli, end of discussion. Thus, it does not seem like there are any distributional assumptions to make.

The binary model still needs to make assumptions about the relationship between the Bernoulli distribution parameter $p$ and the regressor/predictor variables, and this relationship can be a misspecification.

The distribution that is considered is not just the binary class variable $Y$, but also the regressor/predictor variable $X$ and the distributional assumptions are not only about $Y$, but about the joint distribution $Y,X$.

See for example the situation below where a logistic regression without a quadratic term wouldn't be able to correctly fit the true probability. So interpreting the obtained fit as an estimate of the conditional probability $\hat{p}|X = f(X)$ does not need to relate to the true probability $p|X = P(class = 1| X)$ and is not only subject to statistical noise, but also to bias.

The article proposes to estimate $\hat{p}$ seperately from the logistic regression.

example of case where logistic model would be a misspecification

  • 2
    I am actually unsure why random forest with class probabilities is considered to make distributional assumptions as well, but maybe that has to do with their theorems and that the random forest violates these, to understand this I would have to dig deeper into that article. – Sextus Empiricus Mar 18 '24 at 08:18