0

I am aware that Random Forests aren’t typically affected by collinearity issues, but I am trying to reduce how many variables I am utilising in my RF model.

There are variables that are obviously correlated such as GPS coordinates (lat,long) and area variables such as city, state, region. I was thinking surely any useful information will be able to be expressed solely through lat/long making the other variables redundant.

I also have an id variable as well as a time-stamp of when the record was added, which is also clearly correlated. Since this is a regression task, I thought the time stamp may be useful, but in that case should I remove the id variable?

I am not too sure how to proceed, as I need to drastically reduce how many variables I am using in my model, as my reqs need as few variables as possible, and also so my code doesn’t take ages to run.

Any advice would be invaluable.

  • 1
    Possibly a duplicate: https://stats.stackexchange.com/questions/141619/wont-highly-correlated-variables-in-random-forest-distort-accuracy-and-feature, https://stats.stackexchange.com/questions/377033/collinearity-of-features-and-random-forest/377145#377145, – kjetil b halvorsen Mar 09 '22 at 20:58
  • Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. – Community Mar 09 '22 at 22:13

0 Answers0