I have a model where I can either code a predictor variable as a continous variable with the raw values, or as a categorical variable by assigning the values to quartiles. When I run a random forest model with the continous variable, the continous variable comes out as the most important variable in the model. When I run the model with the same variable as a categorical predictor, it becomes much less important to the model. This has implications for interpretation, so I am curious if there is a mechanical reason (i.e., something in the way that random forests fit data) for why this happens. I know the categorcial variable contains less information, but model fit is not significantly improved by the continous version of the variable. I have reason to be catious about using the continous variable because of the quality of the information in the variable.
Asked
Active
Viewed 167 times
2
-
"model fit is not significantly improved by the continous version of the variable" - how does performance change on the training set and test set, separately? – Ben Reiniger Mar 20 '23 at 17:57
-
I have been using OOB sampling to estimate model fit – Bobby Davis Mar 20 '23 at 18:33
-
Which measure of feature importance are you using? // It might be useful to know how much better the overall training score is with the continuous feature. – Ben Reiniger Mar 20 '23 at 19:07
-
I'm using impurity for feature importance. I don't have the specific numbers, but the OOB r2 may go from ~ 22% to something like 24% with the switch from the categorical to the continous variable. It's a pretty trivial amount. – Bobby Davis Mar 20 '23 at 19:11
-
Related – Dave Oct 02 '23 at 18:06
1 Answers
0
The random forest may be able to fit the training set significantly better using the raw continuous score, leading to high importance for it. Meanwhile, the impact of that additional information on OOB scores could be positive, negative, or almost-zero, depending on whether that information in the training set is noise or not.
Ben Reiniger
- 4,514