How would you deal with categorical data in a naive Bayesian classifier?

Question

I've built a little naive Bayesian classifier that works with Boolean and real values. Boolean distributions are dealt with via Bernoulli distributions, while real valued data are dealt with kernel mixture estimators. I'm currently in the process of adding count data in.

How would one deal with categorical data though, e.g. Monday, Tuesday, Wednesday, or Toyota, Honda, Ford?

My initial thoughts are to assign a number to each category, treat it as a normal real value and round to the nearest integer category on prediction. That seems very wrong to me though.

Not sure about the correct answer, but your initial thought is definitely incorrect (as you suggest), as that implies that there is an ordering to the categories. — eykanal, Dec 17 '12 at 20:51
I'm not sure I follow what you mean "treat it as a normal real value and round to the nearest integer category on prediction", but it sounds like an incorrect approach. Categorical valued data is treated similar to boolean data. It is a discrete variable and should be treated as discrete, not real value. Just assign integers $1,...,K$ as labels to your $K$ categories, then use the counts to estimate parameters of your categorical distribution just as you used the counts to estimate the Bernoulli's. — jerad, Dec 17 '12 at 21:21
Correct, it is the incorrect approach. I wasn't sure what the correct technique was. Also, oddly enough the platform I'm using, Mathematica, doesn't seem to have a categorical/discrete distribution available to it.
Jarad, thanks for the pointer I'll get working on this. — , Dec 17 '12 at 22:36

score 3 · Answer 1 · answered Dec 18 '12 at 01:49

3

For an Naive Bayes classifier, categorical values are the easiest to deal with. All you are really after is P(Feature | Class). This should be easy for the days of the week. Compute P(Monday | Class=Yes) and so on.

answered Dec 18 '12 at 01:49

broccoli

1,048

score 0 · Answer 2 · answered Mar 17 '16 at 17:14

0

The way to deal with categorical data is to create each category as a feature and with boolean values. Not only this way removes the limitation of categories for some of the libraries (no applicable here since you have written your own function) but also it's fast.

answered Mar 17 '16 at 17:14

atul anand

1

Are you sure this works? If we encode each category as a boolean feature, that pretty much implicitly violates the independence assumption of Naive Bayes. Please see Ami Tavory's answer to this question. – AdmiralWen Jul 06 '17 at 03:17

How would you deal with categorical data in a naive Bayesian classifier?

2 Answers2