I am running a random forest model in the following variables(attached):
Is it necessary with Random Forests Classifier to remove highly correlated variables or should I leave the model as is? If I were going to remove 'redundant variables', what is the best approach?
My workflow currently is:
Calculate pearson's correlation coefficient
- assess the variable importance chart and systematically remove correlated variables to important ones, one by one and run the model and observe performance.
- I will stop removing variables once the OA, CA, UA decrease. The other method I was doing was removing the least important variables one by one and running the model but Pearson's Correlation Coefficient might be a more defensible approach.
Thanks
