Is it necessary to remove redundant variables from a random forest classification model?

Question

I am running a random forest model in the following variables(attached):

Is it necessary with Random Forests Classifier to remove highly correlated variables or should I leave the model as is? If I were going to remove 'redundant variables', what is the best approach?

My workflow currently is:

Calculate pearson's correlation coefficient

assess the variable importance chart and systematically remove correlated variables to important ones, one by one and run the model and observe performance.
I will stop removing variables once the OA, CA, UA decrease. The other method I was doing was removing the least important variables one by one and running the model but Pearson's Correlation Coefficient might be a more defensible approach.

Thanks

Did you review this thread? https://stats.stackexchange.com/questions/141619/wont-highly-correlated-variables-in-random-forest-distort-accuracy-and-feature — R Carnell, Feb 09 '24 at 19:58
It isn't logically or physically necessary, but it sounds like a good idea to explore. But if you are primarily concerned with generalizable predictive accuracy then I would focus on that as your criterion for deciding. — Galen, Feb 09 '24 at 20:04

score 1 · Answer 1 · answered Feb 10 '24 at 17:42

I would claim its not normally a good idea, it is better to average correlated variables. Consider that random forest is itself the average of a bunch of correlated predictors (trees). The underlying assumption is that the correlation comes due to correlation with the dependent variable, and averaging reduces independent noise between the variables.

Is it necessary to remove redundant variables from a random forest classification model?

1 Answers1