I have a problem with a simple Naive Bayes calculation. Given that I have an inbox, with the following characteristics:
- The mailbox contains 100 emails
- 50 emails contain the word “money”.
- 30 emails contain the word “viagra”.
- The user manually flagged 30 emails as spam, of which 25 emails contain the word “viagra” and 25 emails contain the word “money”, but we don't know which emails contain which words exactly.
So the probability of an email having both the words "money" and "viagra" to be spam should (emphasis on the should) be:
$$ P(spam \mid money \cap viagra) = \frac{P(money \mid spam) P(viagra \mid spam)P(spam)}{P(viagra)P(money)} $$
But if I plug in the numbers I get: ~1.39 which is a probability greater than one. Is this due to the assumption of independence of the NB or did I get something wrong?