I have read a number of questions where the crux is a lamentation that a rare outcome is unable to be predicted by a regression model of some kind. While I understand the desire to be able to reliaby predict these events (imagine being able to predict major financial crashes) and the disappointment in not being able to do so, I struggle to understand why this is something other than expected behavior.
As an example, let's consider a "classification" problem with two outcome categories, coded as $0$ and $1$. Modeling aims to predict, for a feature vector $x$, $P(Y=1\vert X=x)$, which can be rewritten using Bayes' theorem.
$$ P(Y=1\vert X=x) = \dfrac{ P(X = x\vert Y = 1) }{ P(X = x) }\times P(Y = 1) $$
If category $1$ is unusual, then Bayes' theorem says that, since $P(Y = 1)$ is low, so is $P(Y=1\vert X=x)$ unless $\dfrac{ P(X = x\vert Y = 1) }{ P(X = x) }$ is high.
(There is an analogous expression for continuous distributions for which the $``X"$ probabilities are zero, and measure theory lets us generalize without regard for whether the distributions are continuous, discrete, or neither.)
Now, $\dfrac{ P(X = x\vert Y = 1) }{ P(X = x) }$ has an interpretation of by how much the probability that $X=x$ within group $1$ differs from the overall probability that $X=x$. That is, by how much does such an $x$ stick out in the feature space, or if such an $x$ is just business as usual, nothing to see here. If this $x$ is rare overall (low denominator) but common in group $1$ (high numerator), then this fraction will be high and can overwhelm a low $P(Y = 1)$ when $Y=1$ is a rare outcome. If such an $x$ is generally common, however, then the denominator will be large, so even if such an $x$ is common in group $1$, well such an $x$ is generally common, so this isn't news. "Business as usual" applies to this situation.
Therefore, when I see questions like this and this posted, if you do believe that the event can be reliably predicted, my mind goes to two possibilities.
You are missing an important feature that would help separate rare events from the majority of the data (maybe those events aren't so rare once you condition on some missing feature).
The important features are in the model but are being used improperly, such as failing to model an interaction as would be important in the image below, taken from my answer here.
Could there be a third possibility, perhaps something related to the loss function? How much does an inability to predict an apparent anomaly mean that we lack something in the feature space to distinguish it from business as usual? I do not want to be so focused on this facet that I lose sight of other possibilities.
POSSIBLY RELATED LINKS
This started out sounding like it was insistent on predicting unusual values but has taken a turn toward what I am wondering here, if some kind of loss function can better guide the model to these extreme predictions without totally sacrificing the ability to predict the more mainstream values. (Perhaps this means that some loss function might do an especially good job of estimating $\frac{ P(X = x\vert Y = 1) }{ P(X = x) }$.)
