2

Most of classification models are based on searching for a class maximizing P(class | features). But why not the opposite, P(features | class)?

Then, the second question is, are there any situations where the latter is more appropriate?

sitems
  • 3,929
  • Have a look at the Bayes Decision Rule. The rule describes exactly what you wrote here: that we should choose the class that maximizes P(class | features). Why use the Bayes decision rule? Because it minimizes the probability of error. – mhdadk Aug 07 '22 at 14:44
  • See also https://stats.stackexchange.com/q/118696/296197 – mhdadk Aug 07 '22 at 14:45
  • 1
    It's because in the canonical regression problem, we have the features and want to figure out what the class is, not the other way around! – John Madden Aug 07 '22 at 15:13
  • Why are you asking? Could you tell us more about how would you like to use it? As the answers below say, the obvious answer is that if you want to predict an unknown class from the features, then P(features | class) is not really useful. – Tim Aug 08 '22 at 07:54
  • We actually use the latter problem exactly to convert it to the former one. (LDA as an example). – ttnphns Aug 08 '22 at 10:08

3 Answers3

1

We do classify with P(features | class). Recall from Bayes rule...

$$ P(y \vert x ) \propto P(x \vert y) P(y) $$

with equality if we divide by a normalizing constant. P(features | class) is a part of P(class | features)

0

The class is unknown. The features are known. It doesn’t make sense to use the unknown to predict the known. We neither have access to the unknown nor need to guess about the known.

For instance, it makes sense to use world events from this past weekend to predict if the stock market will go up or down tomorrow. It makes less sense to use tomorrow’s market movement to predict what happened over the weekend.

Also worth a mention is that, by Bayes’ theorem, P(features|class) is implicitly considered by P(class|features).

Dave
  • 62,186
0

Interesting question, what is actually $P$(features $\mid$ class)? It is the conditional probability distribution of the feature vector, given a specific class outcome. Its expected value $E[P$(features $\mid$ class)$]$ is the average of this probability distribution.

The capital letter $P$ indicates that the distribution of the features is (multivariate) discrete. This means that each possible outcome of the feature vector has one associated probability with it. Also \begin{equation} \sum_{\vec{f} \; \in \; feature-space} P(\vec{f} \mid class) \; = \;1 \end{equation} The more features in the feature-space, the smaller will the probabilities $P$(features $\mid$ class) generally be.

In essence, the distribution of $P$(features $\mid$ class) is interesting to study as it represents the typicality of the various feature outcomes, for the given class. It therefore provides insight into the objects/cases that belong to this specific class. This is where $P$(features $\mid$ class) is of interest.

A final note relates to continuous features. These are real number measurements so Bayes classification rule instead makes use of the density: p(features | class). Density values are only interesting when compared relatively to each other. Or when they are integrated over an interval. So for continuous features p(features | class) is generally not studied.