How much does an inability to predict an apparent anomaly mean that we lack something in the feature space to distinguish it from business as usual?

Question

I have read a number of questions where the crux is a lamentation that a rare outcome is unable to be predicted by a regression model of some kind. While I understand the desire to be able to reliaby predict these events (imagine being able to predict major financial crashes) and the disappointment in not being able to do so, I struggle to understand why this is something other than expected behavior.

As an example, let's consider a "classification" problem with two outcome categories, coded as $0$ and $1$. Modeling aims to predict, for a feature vector $x$, $P(Y=1\vert X=x)$, which can be rewritten using Bayes' theorem.

$$ P(Y=1\vert X=x) = \dfrac{ P(X = x\vert Y = 1) }{ P(X = x) }\times P(Y = 1) $$

If category $1$ is unusual, then Bayes' theorem says that, since $P(Y = 1)$ is low, so is $P(Y=1\vert X=x)$ unless $\dfrac{ P(X = x\vert Y = 1) }{ P(X = x) }$ is high.

(There is an analogous expression for continuous distributions for which the $``X"$ probabilities are zero, and measure theory lets us generalize without regard for whether the distributions are continuous, discrete, or neither.)

Now, $\dfrac{ P(X = x\vert Y = 1) }{ P(X = x) }$ has an interpretation of by how much the probability that $X=x$ within group $1$ differs from the overall probability that $X=x$. That is, by how much does such an $x$ stick out in the feature space, or if such an $x$ is just business as usual, nothing to see here. If this $x$ is rare overall (low denominator) but common in group $1$ (high numerator), then this fraction will be high and can overwhelm a low $P(Y = 1)$ when $Y=1$ is a rare outcome. If such an $x$ is generally common, however, then the denominator will be large, so even if such an $x$ is common in group $1$, well such an $x$ is generally common, so this isn't news. "Business as usual" applies to this situation.

Therefore, when I see questions like this and this posted, if you do believe that the event can be reliably predicted, my mind goes to two possibilities.

You are missing an important feature that would help separate rare events from the majority of the data (maybe those events aren't so rare once you condition on some missing feature).
The important features are in the model but are being used improperly, such as failing to model an interaction as would be important in the image below, taken from my answer here.

Could there be a third possibility, perhaps something related to the loss function? How much does an inability to predict an apparent anomaly mean that we lack something in the feature space to distinguish it from business as usual? I do not want to be so focused on this facet that I lose sight of other possibilities.

POSSIBLY RELATED LINKS

How to incentivise AI to make risky predictions

This started out sounding like it was insistent on predicting unusual values but has taken a turn toward what I am wondering here, if some kind of loss function can better guide the model to these extreme predictions without totally sacrificing the ability to predict the more mainstream values. (Perhaps this means that some loss function might do an especially good job of estimating $\frac{ P(X = x\vert Y = 1) }{ P(X = x) }$.)

Perhaps 3. your rare event is just that, a rare event, something far away from the conditional (!) mean. So perhaps you should not be modeling the mean, but extremal events or quantiles. (It does not help that much of modeling is narrowly focused on the mean, and that heteroskedasticity often does not get the attention it deserves - especially in prediction.) — Stephan Kolassa, Oct 30 '23 at 18:32
@StephanKolassa In such a situation, though, if you wanted to accurately predict that value and were not, wouldn't that mean that you're missing something that distinguishes that point from the others? // Your idea about conditional quantiles is interesting, but if you're trying to reliably predict apparent anomalies, then they shouldn't be conditional anomalies at the extreme conditional quantiles, should they? — Dave, Oct 30 '23 at 18:49
And that is exactly what I mean by my proposal for point 3: there are anomalies that remain improbable (though not impossible), even conditional on all reasonable predictors. I may get in a car accident tomorrow even though I do nothing out of the ordinary, simply because a number of unlucky things come together (and if we modeled all of those, we would be overfitting). I have a lot of respect for residual uncertainty, which per above I believe does not get enough attention... — Stephan Kolassa, Oct 30 '23 at 18:56
That might be a key point in all of this: once you run out of reasonable features, all that are left are not-so-reasonable predictors, and it will be hard to model with them while avoiding overfitting. — Dave, Oct 30 '23 at 19:14
@StephanKolassa I wonder if I'm focusing of discriminative learning and ignorning what would happen in generative learning. $//$ An answer here that expands on your "extreme value theory" comment here could be quite interesting. — Dave, Nov 09 '23 at 16:02
I find it somewhat contradictory that on the one hand you say "It's expected behavior", and on the other you say that in this case "You are missing a feature or using them incorrectly". You could have all features conceivable for predicting a continuous Y, use them correctly (i.e. have the correct model), and if the conditional distribution of Y|X is fat tailed, still get anomalous unpredicted events. Question reminds me of Taleb's "Black swan" — dherrera, Dec 08 '23 at 00:06
@dherrera But if you want to be able to reliably predict black swan events, shouldn't you have to have rather strong evidence of an upcoming black swan instead of business as usual? — Dave, Dec 08 '23 at 00:10
The point I wanted to make is that in your question it seems to me like you say both "Some events are impossible to predict" and "Your inability to predict an event like that necessarily means that you lack the correct feature/model" (rather than the unpredictability of the event). These two seem contradictory to me, maybe it's an issue with how the question is written. — dherrera, Dec 08 '23 at 00:13
@dherrera Perhaps those features don't exist (inherently unpredictable), or perhaps they do but you did not collect them or do have them but use them poorly. — Dave, Dec 08 '23 at 00:16
What do you mean by "an inability to predict an apparent anomaly"? How is it that you call it an 'inability'? Say we flip a coin a thousand times and record the number of heads, then for each outcome the probability is gonna be a small number. Why characterize that as an anomaly and an inability to predict them? — Sextus Empiricus, Dec 12 '23 at 15:18

score 0 · Answer 1 · answered Dec 12 '23 at 16:25

This question sounds very philosophical to me.

The skepticist or logical positivist thought herein would be that the attempts to obtain truths via inference are meaningless. The "solution" is not that we figure out some way how truth works, but that we accept that we can not know it via the paths that are currently available to us.

Statistics, like Bayesion posteriors, are not methods to obtain true knowledge. Instead they are pragmatic approaches for humans to optimize their believes and behaviour. We can not know the truth but we do know that following the logic of statistics will optimise our average output.

From this perspective the Bayesian approach of computing posterior distributions is ignorant towards the idea that a particular observation is an anomaly. The concept of an anomaly does not exist in the Bayesion framework. The point is to figure out the most likely solution within the model that we have at hand. As long as the model works then we will be happy.

When our Bayesion model happens to perform badly, and we get time after time unexpected and bad results, then we may scratch our head and wonder whether our models are correct and try to come up with alternative models, but that is a different process.

Then, the question

How much does an inability to predict an apparent anomaly mean that we lack something in the feature space to distinguish it from business as usual?

becomes "solved" by the statement that it is not a question that can be answered as it is not quantifiable. Trying to quantify the 'how much' can not be done. All we have is the ongoing process of research that is comparing current believes and performing Bayesian updating, and research that is figuring out ways that can makr predictions more accurate and anomalies explained with higher accuracy.

How much does an inability to predict an apparent anomaly mean that we lack something in the feature space to distinguish it from business as usual?

1 Answers1

Linked