I've been looking into using Bayes to classify incoming emails to one of several distinct "owners" (so more complex than a spam filter that only has two outcomes).
I don't have a stats background, so my nomenclature is likely to be poor. Although it'll make this post long, I'll start with some of my understanding below in the hope that puts things in context.
I (think I've) understood the simple applications, such as:
I have two bowls. Both bowls have 40 biscuits. Bowl 1 has 30 vanilla and 10 chocolate. Bowl 2 has 20 vanilla and 20 chocolate. If I pick a chocolate biscuit, what's the probability I chose from bowl 1?
p(B1 | v) = The probability of Bowl 1, given we chose a vanilla biscuit
p(B1) = The probability of Bowl 1 (there are two bowls, so 1/2 = 0.5)
p(v | B1) = The probability we'd get vanilla if we did chose from Bowl 1 (30/40 = 0.75)
p(v) = The probability of choosing vanilla from all bowls (30+20 / 40+40 = 0.625)
p(B1 | v) = p(B1) * p(v | B1) ----------------- p(v) = 0.5 * 0.75 ---------- 0.625 = 0.6
=> 60% chance that if we chose vanilla it came from Bowl 1
If I run the numbers for p(B2 | v) I get 0.4, so this works nicely.
I note that if the input data had bowl 1 with 20 biscuits (4 vanilla, 16 chocolate) and bowl 2 with 80 biscuits (1 vanilla, 79 chocolate) I get some odd output numbers (that don't add up to 1):
- p(B1 | v) = 2
- p(B2 | v) = 0.125
However, I see from Naive Bayes classifier gives a probability greater than 1 that this is likely because these are probability densities?
I note that there's a 16:1 ratio, which seems about right, and if I divide both by the sum (2 / 2.125 and 0.125 / 2.125) I get ~0.94 and ~0.06 respectively (which adds up to 1, and still has the same ratio). I assume this is all "ok".
If I understand correctly, I can add a third bowl (or more), with the only changes being that p(B1) would now be 1/3, or 1/4 etc (depending on the number of bowls), and p(v) would also change as I'd have to sum all vanilla biscuits across all bowls, divided by all biscuits across all bowls.
Assuming I haven't made a complete mess of the above... onto the actual problem...
What I want to do is classify the probability that a word in a candidate email belongs to one of my prior known "bowls". I build a word frequency database for each bowl (from prior emails), and test in the same way. E.g. if bowl 1 contains 30 instances of "the" and 10 of something else, and my candidate word is "the" I can check the probability that "the" came from bowl 1 (or bowl 2 etc).
The bit I'm struggling with is how to then integrate the results from every word in the incoming email.
With a bit of empirical testing, it seems that multiplying the resulting probability value with the previous is mathematically correct (e.g. if I throw two sixes with dice, the probability is 1/6 x 1/6 = 1/36). Thus I'd multiply the probability for each word for each bowl, and the bowl that has the largest final probability is the winner. E.g.:
Final probability of bowl 1 = p(B1 | "the") * p(B1 | "cat") * p(B1 | "sat") ...
Note that if "the" appeared in my candidate email twice, I'd use p(B1 | "the")^2 instead, and p(B1 | "the")^3 if it appeared 3 times, and so on.
However, it's likely that a candidate email may contain a word that doesn't exist in a bowl, in which case I get an output of 0 (meaning that the total result for that bowl would be 0).
Mathematically that's correct - if I had bowls with 3 types of biscuits (e.g. chocolate, vanilla, strawberry) and took a handful from one bowl (e.g. vanilla, vanilla, strawberry), then if any of those bowls didn't contain any strawberry biscuits there's 0 probability I could have taken the biscuits from that bowl - regardless of the probabilities of the other bowls.
In this instance however, I'm looking for which bowl was the "best fit" (i.e. on balance, the most likely).
I could just sum the probabilities (p(B1 | "the") + p(B1 | "cat") + p(B1 | "sat")) and choose the bowl with the largest number, but I'm aware I'm now out of my depth, and likely doing the "wrong" thing from a stats point of view.
Any re-education on my understanding of Bayes above, and advice on how to sum the result for each candidate word would be greatly appreciated. Preferably in layman's terms - because that's what I am ;)