Categorical variables sklearn random forest

Question

I'm a bit confused with the use of Random Forest in Sklearn in case we have categorical variables. I've read this article stating that one hot encoding affects performance negatively when using Decision Tree-based methods.

https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/

I'm facing this exact problem with a dataset. I have a high number of features with continuous values and three other categorical features that I know are highly correlated with the target variable. However when I use mutual_info_classif and RandomForest these categorical variables are considered not important. I have tried keeping them as categorical variables but as far as I know, sklearn implementation can't handle categorical variables. I've also one-hot encoded them but then I'm facing the problem explained in the link.

What are the options left?. Should I just change the library used for Random Forest?. Maybe change to another model that can be used more efffectively with one hot encoding like SVM?

Does this answer your question? Will decision trees perform splitting of nodes by converting categorical values to numerical in practice? — Ryan Volpi, Feb 17 '21 at 12:56

score 0 · Answer 1 · answered Mar 22 '20 at 11:21

You could encode your categorical variables using subsets of categories to reduce the sparsity. For example if the categories of a variable are A, B, C and D, you can create new binary variables "is A or C", "is A or B or D", "is C", etc. The question is then how to come up with relevant subsets. A simple heuristic could be to create random subsets of fixed size where each category is picked with probability proportional to its correlation with the target.

As for other models, in your particular case I would recommend having a look at CatBoost.

Categorical variables sklearn random forest

1 Answers1