I'm currently implementing a Gaussian Naive Bayes classifier. Of course if I'm doing classification by
$$ \text{argmax}_{C_i} P(C_i)P(D|C_i), $$
then the probabilities can get very small. So I want to use log probabilities. I'm seeing three posibilities:
$$ \text{argmax}_{C_i} P(C_i)\log P(D|C_i), $$
$$ \text{argmax}_{C_i} \log P(C_i) \log P(D|C_i), $$
$$ \text{argmax}_{C_i} \log P(C_i) + \log P(D|C_i), $$
Which of them are the correct way to go? From a calculation point of view the second one is the right because for the others I'm getting negative values but from a math point of view the third one is the right due to the following:
$$ P(C_i|D) = \frac{P(C_i)P(D|C_i)}{P(D)} = P(C_i)P(D|C_i) $$
$$ \log P(C_i|D) = log[P(C_i)P(D|C_i)] = \log P(C_i) + \log P(D|C_i) $$
P(D) can be dropped because it does not depend on the class. Anyway for all variants I'm getting values outside [0,1] but I think this is ok because I'm calculating probability densitiy (from Gaussian distribution) and not probability.
I have a second question. I'm also interested in getting the importance for each feature for each pair of classes. How could this be calculated based on Gaussian Naive Bayes? I need this because I want to visualize the 10 most important features for each pair of classes.
One thing to check would be the actual arithmetic. Floating-point calculations with a mix of many very big small values sometimes yield results that are surprisingly far from the "real" answer.
I'd also check to make sure that none of the features were "blowing up" and assigning no/very low probability (or, in log-space, $-\infty$) to all classes. This can easily swamp the "good" features.
– Matt Krause Oct 07 '16 at 16:17