0

I am running a random forest model in the following variables(attached):

enter image description here

Is it necessary with Random Forests Classifier to remove highly correlated variables or should I leave the model as is? If I were going to remove 'redundant variables', what is the best approach?

My workflow currently is:

Calculate pearson's correlation coefficient

  • assess the variable importance chart and systematically remove correlated variables to important ones, one by one and run the model and observe performance.
  • I will stop removing variables once the OA, CA, UA decrease. The other method I was doing was removing the least important variables one by one and running the model but Pearson's Correlation Coefficient might be a more defensible approach.

Thanks

Galen
  • 8,442
  • Did you review this thread? https://stats.stackexchange.com/questions/141619/wont-highly-correlated-variables-in-random-forest-distort-accuracy-and-feature – R Carnell Feb 09 '24 at 19:58
  • It isn't logically or physically necessary, but it sounds like a good idea to explore. But if you are primarily concerned with generalizable predictive accuracy then I would focus on that as your criterion for deciding. – Galen Feb 09 '24 at 20:04

1 Answers1

1

I would claim its not normally a good idea, it is better to average correlated variables. Consider that random forest is itself the average of a bunch of correlated predictors (trees). The underlying assumption is that the correlation comes due to correlation with the dependent variable, and averaging reduces independent noise between the variables.

seanv507
  • 6,743