What happens to multi-category variables in algorithms like Random Forest that sample the feature space?

Question

Suppose I have a multi-level categorical variable like color (say, with 7 levels). Some software libraries only allow numeric matrices to train models, so we need to encode the color variable.

In this case we would use [levels - 1] = 6 columns to do this. But what happens with models like random forests or gradient boosting, where the feature space is also sampled? In this case, it feels like I'm losing information, because the probability of a given 'variable level' being selected will decrease.

I hope someone can help me to understand this better.

score 0 · Accepted Answer · edited Apr 13 '17 at 12:44

This thread is similar and have a fine answer an some comments

It is often not that much of a problem. In Sklearn RandomForest.regressor/classifier you can get away with treating factors as numeric levels. If you are uncomfortable with this, you could implement many random enumerations of categories for many subforests and combine in a ensemble. Think of this as to try some, but not all splits. Arborist(Rborist) has an elegant implementation, trying all categorical splits until a certain upper limit. Hereafter, only a random sub sample of possible splits are tried in each leaf. ExtraTrees use as default no bootstrapping but only try random few splits(both for numerical and categorical). randomForest cannot avoid trying all splits. If many categories (e.g 10 categories gives $2^9$ possible splits), there will be a cost on speed and rarely any performance advantage.

Related answer on how to convert categorical features link

What happens to multi-category variables in algorithms like Random Forest that sample the feature space?

1 Answers1