A 2020 NeuroIPS paper by Gupta, Podkopaev & Ramdas addresses the calibration of outputs to binary “classification” models, admitting that the raw scores, despite perhaps being on $\left[0, 1\right]$, need not have literal interpretations as probabilities until they have been assessed for calibration and, if necessary, adjusted to reflect the reality of event occurrence (e.g., Platt scaling).
Throughout the paper, there is reference to distribution assumptions and how various calibration techniques are sensitive to those distribution assumptions. As far as I can tell, in a binary “classification” model, the outcome is conditionally Bernoulli, end of discussion. Thus, it does not seem like there are any distributional assumptions to make. I would be on board with the idea of multiple possible distribution assumptions for an outcome that is more complex than binary (all values on the continuum are possible, for instance, could be conditionally Gaussian, conditionally t-distributed, etc). For the binary outcome, though, that seems bizarre. The outcome is conditionally Bernoulli, and that’s that.
What do Gupta, Podkopaev & Ramdas see differently? What distribution assumptions do they think could be made with a binary outcome?
