Most of classification models are based on searching for a class maximizing P(class | features). But why not the opposite, P(features | class)?
Then, the second question is, are there any situations where the latter is more appropriate?
Most of classification models are based on searching for a class maximizing P(class | features). But why not the opposite, P(features | class)?
Then, the second question is, are there any situations where the latter is more appropriate?
We do classify with P(features | class). Recall from Bayes rule...
$$ P(y \vert x ) \propto P(x \vert y) P(y) $$
with equality if we divide by a normalizing constant. P(features | class) is a part of P(class | features)
The class is unknown. The features are known. It doesn’t make sense to use the unknown to predict the known. We neither have access to the unknown nor need to guess about the known.
For instance, it makes sense to use world events from this past weekend to predict if the stock market will go up or down tomorrow. It makes less sense to use tomorrow’s market movement to predict what happened over the weekend.
Also worth a mention is that, by Bayes’ theorem, P(features|class) is implicitly considered by P(class|features).
Interesting question, what is actually $P$(features $\mid$ class)? It is the conditional probability distribution of the feature vector, given a specific class outcome. Its expected value $E[P$(features $\mid$ class)$]$ is the average of this probability distribution.
The capital letter $P$ indicates that the distribution of the features is (multivariate) discrete. This means that each possible outcome of the feature vector has one associated probability with it. Also \begin{equation} \sum_{\vec{f} \; \in \; feature-space} P(\vec{f} \mid class) \; = \;1 \end{equation} The more features in the feature-space, the smaller will the probabilities $P$(features $\mid$ class) generally be.
In essence, the distribution of $P$(features $\mid$ class) is interesting to study as it represents the typicality of the various feature outcomes, for the given class. It therefore provides insight into the objects/cases that belong to this specific class. This is where $P$(features $\mid$ class) is of interest.
A final note relates to continuous features. These are real number measurements so Bayes classification rule instead makes use of the density: p(features | class). Density values are only interesting when compared relatively to each other. Or when they are integrated over an interval. So for continuous features p(features | class) is generally not studied.