Will decision trees perform splitting of nodes by converting categorical values to numerical in practice?

Question

In Decision Trees, when doing classification or regression, do we use only numerical values? Suppose I have a categorical column Wind as a feature. Suppose I am have 5 rows (observations), and the values for Wind are [high,low,high,medium,medium].

Can I feed this categorical data into a decision tree classifier (like scikit-learn), will it take the data? That is, will the splitting of node based on wind happen on categorical values high with count 2, medium with count 2 and low with count 1. Or, will the system converts it into dummy variable?

score 18 · Accepted Answer · edited Jun 29 '19 at 15:25

The theory is different than the practice. Basically, how the algorithms were described in theory is not always respected in practice. In the case of scikit-learn, the decision trees are implemented considering only numerical features. This is wrong, or at least, not complete, since for nominal variables you have different ways to find splits than for numerical variables.

A short example. Numerical variables are first sorted and walking through the sorted values, new tests are created and evaluated. Tests on numerical inputs have the form: to the left all values greater than equal with a possible split value, to the right the other. For nominal values there are various tests, one of them has the form: on the left goes instances with one category, on the right the rest of the instances.

One can always transform a nominal variable into a numeric one using an encoding of some sort. This is fine as long as you do not do some math with the encoded values. But the problem with the implementation from scikit-learn is that it will use the logic for numerical variables on split, applied on the numerical encoding of a nominal variable. Thus, you really can't reproduce the behavior for nominals that I had given to you as an example.

It received also consideration the one hot encoding alternative. This means you will transform a nominal variable into multiple binary variable (one for each level of the nominal variable). Treating numeric encoding of nominal variables or one hot encoding might produce results (no doubt of that). The point is that they produces different results than the algorithm specified in theory since they do not fully implement it.

There are various ML libraries, most of them having support for treating categorical variables (R names them factors, Weka names them nominals, etc).

[Later edit: address comments from Soren and Candic3]

As far as I know specification from Breiman on random forests considers binary splits on nominal variables from all groups of variables combinations. As Soren said, this are $2^{L-1}$. This is too much. However there are two things which can be said about that.

First one is that you don't have to follow this recipe for nominal. You can search for binary splits on nominal variables of the form one vs all. These are short in number, even for variables with many levels. One can say that you can transform into binary indicators with some encoding. This is true, but not complete, since you have to deal somehow with missing values, which is not an easy job with encoding. Using surrogates or weighted random is impossible. Somehow you must do variable imputation before the encoding takes place, or consider missing as another category.

The second thing is that in case of binary classification for Gini impurity there are proofs that shows that the best split from all $2^{L-1}$ is found much faster if you use the following procedure. Compute purity function for each level subset. Sort variables by the value of impurity function computed on each subset. Then, compute binary splits of subsets where on one side you have first $k$ variables sorted previously, and in the other group the remaining values. This procedure is computed in linear time so, is much efficient and can actually handle variables with more than 32 levels. For details on that see Elements of Statistical Learning, 2ed - page 310, section 9.2.4 Other Issues. You will find there an explanation and following references.

My personal conclusion in topics like that is the following. If possible use categorical treatment. This is sometimes possible even for large number of levels. However for large number of levels you should proceed with caution. Too much levels can be translated in low predictive power. So maybe a grouping of the levels gives more than use directly the variable.

For example if you have zip codes. The number of levels can be huge. A direct treatment is possible, but perhaps some decoding of zip codes into regions has an increased level of generality and thus more predictive power.

My thinking is that too much granularity in nominal levels should be analyzed and solved in a more interpretative way that directly by programming tricks.

Nicely answered. So, R has that ability . Thanks for the information. — Sarath R Nair, May 15 '15 at 10:20
R is able to do that. However, note that in R factor variables are allowed to have a maximum number of levels. The maximum is 32 or 64 or something else, I don't remember now. I hope you don't have this case. — rapaio, May 15 '15 at 10:22
And if more than 32 levels you can use extraTrees algorithm instead. — Soren Havelund Welling, May 15 '15 at 10:43
R factors can ofc have a lot more levels. Implementations of tree models such as randomForest cannot handle more than 32 levels, because every possible split is tried and that increases exponentially, e.g. 2^(32-1)=2.1 10^9. If more than 32 levels one can use the extraTrees algorithm instead which will only try a much smaller random fraction of splits. — Soren Havelund Welling, May 15 '15 at 10:55
So, what is the proper way to handle it in theory? I'm not sure I'm getting that from your answer, @rapaio. Thanks. — makansij, Jan 10 '16 at 06:08

score 3 · Answer 2 · answered May 15 '15 at 07:24

Generally, categorical variables should be treated as such, and not assigned dummy numerical values. Imagine a case where the variable is "hair color", and the possible values are "red, black, blonde, brown". Assigning a numerical value to this variable wouldn't make any sense because there's no inherent ordering. If there is inherent ordering in your variable's values, like with your wind example, you may need to specify that directly (depending on your analysis package) as numerical values, associating each variable state with a number directly - your package probably won't necessarily put them in the order Low, Medium, High by default.

So, I have one question. In the case of hair colors , if the program like scikit learn is obtaining a column of hair colors, will it count the unique number of hair colors ? My primary doubt a computer can do calculations or comparisons based on numerical values. So, if the system find a variable or value in this case as "black" hair color, how come the computer knows , ok its a unique thing. I should split the node by considering it. How that happens in practice ? — Sarath R Nair, May 15 '15 at 07:32

Will decision trees perform splitting of nodes by converting categorical values to numerical in practice?

2 Answers2

Linked

Related