How to formalize mathematically that a binary classifier has no predictive performance?

Question

The objective of supervised learning is to induce a function $f_\theta$, where $f_\theta$ is from a family of functions $f_\theta \in F$, from a training set $D^{tr}=\{(x_0^{tr},y_0^{tr})\ldots, (x_n^{tr},y_n^{tr})\} \subseteq \mathcal{X} \times \mathcal{Y}$ where $\mathcal{X} \times \mathcal{Y}$ denote the domain of the predictors $X$, and the target $Y$ respectively.

Now I want to start a theorem stating that the predictive performance of the model $f_\theta$ is the one of a random classifier, regardless of the model $f_\theta$ used. So no predictive performance.

I believe, using the AUC=0.5 can be an option, but is there any more elegant way to state this?

Edit I am referring specifically to binary classification. My question applies to the math formulation.

$AUC=0.5$ implies a specific type of problem where the outcome $y$ is a binary category. Perhaps this is what you want, but there is more to supervised learning than just these problems. Could you please clarify your question? — Dave, Sep 26 '22 at 02:34
A theoretical notion is infinite loss. Any model that predicts $\hat y = \pm \infty \in \bar{\mathbb{R}}$ gives an $R^2 = -\infty$. Not a useful benchmark for modelling, however. Still, it is hard to be more wrong than infinitely-wrong! :P — Galen, Sep 26 '22 at 03:43
Your question seems to focus on binary classification, so assuming a balanced training set a 50:50-and-pick-em result is as unhelpful as a binary classifier gets. But that isn't 0.5 AUC. That is the central point (0.5, 0.5) for sensitivity vs. 1-specificity. (The AUC curve implies an adjustable hyperparameter $-$ like classification threshold $-$ that you haven't described. ) Also you mention the family $f_\theta$. Do you seek a worthless optimal $f_\hat{\theta}$, specifically with a mind to emphasize the uselessness of the entire family? — Peter Leopold, Sep 26 '22 at 03:44
You need to understand that there is a better default classifier than a random classifier, that is to emit the majority class all the time (higher priori probability class). — Cagdas Ozgenc, Sep 26 '22 at 04:59
@CagdasOzgenc Only if you’re predicting categorical outcomes, and if a random classifier uses the prior probability to assign classes randomly (not just using $0.5$), I think that would give the same performance (probably with some caveat about “in expected value”). — Dave, Sep 26 '22 at 12:12
Regardless of the ratio of classes, I understand, that if you predict the majority AUC =0.5 — Carlos Mougan, Sep 26 '22 at 12:33
@PeterLeopold it does not matter the family of models, $f_\theta$ can be the bayesian optimal classifier, -- i understand this is the best possible classifier — Carlos Mougan, Sep 26 '22 at 12:34
I want to express formally that regardless of the model there is no information that allows to distinguish between the classes $Y$ — Carlos Mougan, Sep 26 '22 at 12:35

Dave · Answer 1 · 2022-09-26T13:30:32.960

This is complicated, since supervised learning can have so many flavors, but a few general principles can lead you to solve special cases as they arise.

The first important topic to consider is what a supervised learning model wants to predict. Many options are possible, but let’s stick with estimating the mean, as this is the most common type of supervised learning task (either classification or regression).

The second important topic to consider os your assessment of model quality. In simple cases, we can look at plots and tell when a fit is decent or trash.

set.seed(2022)
N <- 100
x <- runif(N)
y <- x + rnorm(N, 0, 0.1)
plot(x, y)
abline(1, -1, col = "red")
abline(0, 1, col = "black")
abline(lm(y ~ x)$coef, col = "blue")

For instance, we know that the blue and black lines are much better fits to the data than the red line. However, if you don’t look at the code, can you eyeball if the black line is better than the blue line?

Consequently, it is desirable to quantify the quality, and typical ways of doing this involve calculating some kind of deviation between the observed points and the predicted points. These are so-called loss functions. These loss functions measure some aspect of the quality of your model and how painful the mistakes are.

At the same time, nothing says that you have to do any modeling. If you want to know something about the conditional mean of your $y$, there is an argument (I think a strong one) that the best naïve guess would be to predict the overall mean every time.

Since this is a naïve guess that requires no fancy modeling or machine learning, if your fancy modeling cannot do better than that, then your fancy modeling is ineffective. Put bluntly, why would someone pay lots of money to a data scientist when she could do better just by predicting the same easy-to-calculate number every time?

Consequently, compare the performance of your model to the baseline model that naïvely guesses the same value every time. You do this by comparing the loss incurred by your model to the loss incurred by that naïve model. If your model is complicated, you might want to penalize the model loss or test on out-of-sample data (though this really warrants a separate question).

If you can’t outperform the naïve guessing (and do so routinely), then I would say it is fair to consider your model to have no predictive ability.

This is exactly what $R^2$ and $R^2$-style metrics (like McFadden’s pseudo-$R^2$) do. I have some strong opinions about $R^2$-style metrics that I discuss here, here, and here.

If instead of wanting to predict the conditional mean, you want to predict the conditional median, compare your model performance to naïvely guesses the overall median every time. If you want to predict some other conditional quantile, compare your model performance to the performance of a model that naïvely guesses that overall quantile every time. If you want to predict the probability of belonging to either or two classes (e.g., dog picture or cat picture), compare to a model that always predicts the ratio of the classes (this is actually the mean if you encode the classes as $0$ and $1$). If you want to predict the probability of belonging to any of multiple classes (e.g., MNIST handwritten digit recognition), compare to a model that always predicts the relative frequencies of the various classes.

If you can’t beat the naïve model, you lack predictive power.

Do note that $R^2$-style metrics have their own issues. As this excellent answer (not by me) discusses, $R^2$ can miss aspects of the modeling. However, a comparison of your model to some kind of baseline always makes sense.

A common complaint about $R^2$-style metrics is that they can be driven to perfection by overfitting. This is true, and it is why you use appropriate penalties for model complexity or out-of-sample testing.

The question specifically mentions comparing to $AUC=0.5$. In writing my answer, I’ve realized that I am not sure why that is considered baseline performance, so I have asked for clarification. However, it is considered baseline, and a simple function of $AUC$, called Somers’ $D_{XY}$, gives an $R^2$-style comparison to baseline performance: $D_{XY}=\frac{AUC}{0.5}-1$. A reasonable question to post on Cross Validated is why this is analogous to $R^2$. — Dave, Sep 26 '22 at 03:50
"If you can’t beat the naïve model, you lack predictive power." My question is how to formalize this, in binary classification cases, and being independent of the ratio of classes. AUC=0.5 implies that. — Carlos Mougan, Sep 26 '22 at 13:19
I understand the practical implications of having a bad classifier, my problem is the formalization of this situation. The questions about the analogy seems very interesting, let me know if you post it. — Carlos Mougan, Sep 26 '22 at 13:21
@CarlosMougan Compare model performance to some baseline measure. This is independent of the ratio of classes. — Dave, Sep 26 '22 at 13:44
$AUC=0.5$ is baseline performance, yes. Whether or not $AUC$ is a good measure is up for debate, with our Frank Harrell being a skeptic. — Dave, Sep 26 '22 at 14:16
Yes, I understand the meaning of AUC, my OP still remains unanswered, how to formalize mathematically that a classifier has no predictive performance, --> random classifier. — Carlos Mougan, Sep 26 '22 at 19:44
What remains confusing about not being able to beat the baseline performance? — Dave, Sep 26 '22 at 20:14
I want to state in a theorem "If the performance of $f_\theta$ is the one of a random classifier then......" What I struggle is arriving to the math formalization. — Carlos Mougan, Sep 26 '22 at 20:23

How to formalize mathematically that a binary classifier has no predictive performance?

1 Answers1

Linked

How to *formalize mathematically* that a binary classifier has no predictive performance?

1 Answers1

Linked

How to formalize mathematically that a binary classifier has no predictive performance?