In case of no correlation, can a model make predictions above the expected values?

Question

For simplicity's sake, let's suppose a binary classification problem, with a perfect 50% of probability for each of the classes, and a SkLearn's SVC model.

Let's assume that the classes in the vector Y cannot be predicted out of the values from the matrix X, because they are unrelated, but we don't know that. So we continue collecting observations, hoping that good predictions will be able to be made at some point.

While observations tend to infinite, should probability of the classes (obtained from predict_proba) converge towards that 50%?

If probabilities doesn't converge, does it mean that may be some form of correlation?

Any thoughts are well appreciated. Thanks in advance.

Are the classes balanced 50:50 in the sampling distribution? — Galen, Sep 07 '23 at 17:39
@Galen Let's say they are around 50:50, ie. they will balance as the sample grows. — Juan Flautista De Torrepacheco, Sep 07 '23 at 17:43
It is worth noting that "unrelated" and "uncorrelated" are not the same. There can be a relationship without that relationship being correlation. — Dave, Sep 07 '23 at 18:11
"Unrelated" seems to be an elliptical way of telling us $X$ is irrelevant, so let's ignore it. Your question therefore concerns what happens in the long run when you predict the results of random flips of a fair coin. The answer is given by various laws of large numbers as well as the Central Limit Theorem. — whuber, Sep 07 '23 at 18:29
@whuber That's a good point! But remember that, at the beggnining, you don't konw if X is relevant or not. So, how to know? How predict proba would loook like if it's not? — Juan Flautista De Torrepacheco, Sep 07 '23 at 18:35
It doesn't matter what you know. All that matters is how $Y$ behaves. If you employ useless information, such as $X,$ then you might make poor predictions. But when $Y$ is independent of all your information and has equal chances of being in one of two states, it's a flip of a fair coin and so is the agreement between your prediction and the each outcome of $Y.$ — whuber, Sep 07 '23 at 18:38

Dave · Accepted Answer · 2023-09-07T17:52:22.347

2

You can force an outstanding in-sample accuracy by overfitting to the training data. However, if the features are unrelated to the outcomes, that means the feature distributions are identical for both classes (otherwise, they would be related to the outcomes).

If the feature distributions are identical for the two categories, they you have no hope of reliably distinguishing between the two categories. Given a feature vector, you never have any insight into the category to which it belongs, beyond the fact that one category might generally be more common (which you've excluded as a possibility by saying the categories are represented equally).

You can be a bit more formal about this by using Bayes' theorem. Your model is predicting $P(Y=1\vert X=x)$ for some feature vector $x$, that is, the probability that $Y$ belongs to category $1$ given that the features are the $x$ that they are.

$$ P(Y=1\vert X=x) = \dfrac{ P(X = x\vert Y = 1)P(Y = 1) }{ P(X = x) } $$

When the distributions have identical distributions for categories $0$ and $1$, then $P(X = x\vert Y = 1) = P(X = x)$, and your predicted probability is just the overall probability of category $1$, which is $0.5$ in the balanced case. There is always a $50/50$ chance that the observation belongs to either category.

edited Sep 07 '23 at 17:52

answered Sep 07 '23 at 17:50

Dave

62,186

1

Somehow I never get bored of Bayes' rule. (+1) – Galen Sep 07 '23 at 17:52
Could you please add some information about predict_proba works? Then I'll mark your answer as accepted. Thanks! – Juan Flautista De Torrepacheco Sep 07 '23 at 18:15
@JuanFlautistaDeTorrepacheco The predict_proba method returns probability predictions instead of hard categories, but you seem to know that. What else would you want to see about predict_proba? – Dave Sep 07 '23 at 18:16
@Dave Regarding the second question, if a vector got predicted eg. above 55-45%, would that prove that there is some correlation? Can it be a default of predict proba? – Juan Flautista De Torrepacheco Sep 07 '23 at 18:22
I do not follow the question. Could you please clarify? @JuanFlautistaDeTorrepacheco – Dave Sep 07 '23 at 18:23
@Dave In other words, let's say we suspect from some coins to be tricked, ie. either heads or tails will come out more than a 50% of the times. But the coins are actually fair. (Now replace coins with "vectors to be predicted by the model") – Juan Flautista De Torrepacheco Sep 07 '23 at 18:30
1

@JuanFlautistaDeTorrepacheco What does that have to do with predict_proba? – Dave Sep 07 '23 at 18:49
@Dave You could talk about, for example, how the margins applied in a SVC (soft, hard, etc) are getting broader and broader as more observations are added, resulting in the convergence of the probabilities towards 50:50. – Juan Flautista De Torrepacheco Sep 08 '23 at 10:07
1

@JuanFlautistaDeTorrepacheco The beauty of this is that it doesn’t depend on the model. What I wrote is true whether you use an SVM approach, neural networks, logistic regression, or anything else. // However, nothing guarantees that the in-sample predictions are going to score a particular way. If you’re not careful, you will wind up overfitting your model to the training data. – Dave Sep 08 '23 at 10:15
@Galen There’s probably a way to be more formal with the representation of Bayes’ theorem by considering the continuous case. If someone has thoughts on how to do that, I’d be eager to read it! – Dave Sep 08 '23 at 10:24

In case of no correlation, can a model make predictions above the expected values?

1 Answers1

Linked