Are Random Forests good at detecting interaction terms?

Question

How good random forests actually are at finding interactions between variables? Does adding interaction terms explicitly to the model increase the predictive power of random forests?

See also https://datascience.stackexchange.com/a/61279/55122 — Ben Reiniger, Jan 10 '21 at 19:14
And to upgrade some Related questions to Linked ones: https://stats.stackexchange.com/q/157665/232706 https://stats.stackexchange.com/q/146087/232706 https://stats.stackexchange.com/q/396923/232706 — Ben Reiniger, Jan 10 '21 at 19:24

Bernhard · Answer 1 · 2021-01-10T14:06:05.447

@gunes has already given the answer, that random forests do work well with interactions. It may help to give a practical example, though. For that, we will need some toy data

set.seed(2021)
n = 5000
x1 <- rnorm(2*n)
x2 <- runif(2*n)
interaction <- x1 * x2
pseudointeraction <- sample(interaction)
x3 <- rbeta(2*n, 2, 10)
epsilon <- rt(2*n, df = 1)
y <- 1x1 + 2x2 + 3x3 + 10interaction
example <- data.frame(x1 = x1, x2 = x2, x3 = x3, x4 = rbeta(2n, 5, 5), x5 = runif(2n,-10, 10),
                      interaction = interaction, pseudointeraction = pseudointeraction,
                      y = y)

Basically we have a y that is a linear function of x1, x2, x3 and the interaction x1*x2. In the dataset there is also a permutation of the interaction term and two more random variables, all of which bear no real information. We split this into a training and an equally sized testing dataset like so:

train <- example[1:n,]
test <- example[(n+1):(2*n),]

As the toy data have been modelled in a linear model, linear regression should be able to solve that puzzle easily:

> linear1 <- lm(y ~ . - pseudointeraction , data = example)
> summary(linear1)
Call:
lm(formula = y ~ . - pseudointeraction, data = example)
Residuals:
       Min         1Q     Median         3Q        Max 
-1.071e-12 -3.400e-16  5.000e-17  4.700e-16  1.524e-12
Coefficients:
              Estimate Std. Error    t value Pr(>|t|)

(Intercept) -1.641e-14  9.004e-16 -1.823e+01  < 2e-16 ***
x1           1.000e+00  4.293e-16  2.329e+15  < 2e-16 ***
x2           2.000e+00  7.480e-16  2.674e+15  < 2e-16 ***
x3           3.000e+00  2.076e-15  1.445e+15  < 2e-16 ***
x4           3.798e-15  1.423e-15  2.669e+00  0.00762 ** 
x5           2.698e-18  3.725e-17  7.200e-02  0.94225

interaction  1.000e+01  7.415e-16  1.349e+16  < 2e-16 ***

Signif. codes:  0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.145e-14 on 9993 degrees of freedom
Multiple R-squared:      1, Adjusted R-squared:      1 
F-statistic: 1.647e+32 on 6 and 9993 DF,  p-value: < 2.2e-16

The linear modell is wrong in declaring x4 a significant predictor but as it was to be expected, $R^2$ is about 100%.

If we take the interaction term away or exchange it for the permuted interaction term, things get far worse as was to be expected:

> linear2 <- lm(y ~ . - interaction, data = example)
> summary(linear2)
Call:
lm(formula = y ~ . - interaction, data = example)
Residuals:
     Min       1Q   Median       3Q      Max 
-17.4533  -1.2764   0.0621   1.2977  14.8616
Coefficients:
                    Estimate Std. Error t value Pr(>|t|)

(Intercept)       -0.1791910  0.1214622  -1.475    0.140

x1                 6.0262464  0.0287402 209.680   <2e-16 ***
x2                 2.0022413  0.1009082  19.842   <2e-16 ***
x3                 3.0676424  0.2800356  10.954   <2e-16 ***
x4                 0.1791824  0.1919885   0.933    0.351

x5                 0.0005224  0.0050245   0.104    0.917

pseudointeraction  0.0121493  0.0496305   0.245    0.807

Signif. codes:  0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.894 on 9993 degrees of freedom
Multiple R-squared:  0.816, Adjusted R-squared:  0.8159 
F-statistic:  7385 on 6 and 9993 DF,  p-value: < 2.2e-16

$R^2$ dropped to 82%. So the interaction term is important in this model.

Let's grow a randomForest with the interaction term and one without it:

library(randomForest)
rf1 <- randomForest(y ~ . - interaction, data = example, mtry = 3)
rf2 <- randomForest(y ~ . - pseudointeraction, data = example, mtry = 3)
print(rf1)
print(rf2)

This will take some time to compute and the print results will claim that the random forest with an interaction term given explained 99.67% of the variance while the random forest with the interaction term explained 99.93% of the variance.

It is hard to compare within-sample $R^2$ to Out-of-bag-performance, so we predict our test data set with all four of our models we have so far:

linear1.on.test <- predict(linear1, data = test)
linear2.on.test <- predict(linear2, data = test)
rf1.on.test <- predict(rf1, data = test)
rf2.on.test <- predict(rf2, data = test)

In a next step we compute the absolute residuals and plot them as boxplots:

boxplot(linear1.on.test - test$y, linear2.on.test - test$y, rf1.on.test - test$y, rf2.on.test - test$y,
        ylim=c(-7, 7), xaxt="n", ylab = "residuals")
axis(1, at = 1:4, labels = c("linear with I", "linear w/o I", "rf w/o I", "rf with I"))

So the linear model with interaction term fits best to how the data was constructed and thus predicts best. The linear model without interaction term has no flexibility to adapt and has the worst predictive value of all models.

For the random forests we see quite similar results with and without the interaction term. Stating the interaction term explicitely appears to be helpfull but the model without it is flexible enough to give good predictions without that.

Real world interactions will rarely follow the rules of exact multiplication so the small advantage of computing the interaction term for the model may be even smaller for real world interactions.

All of this will be influenced by sample size, number of features, tuning parameters and so on but the short take away message is that whilst feature building is a good thing, random forest can detect and make use of interactions even if they are not stated in the call to randomForest.

Feel free to run this code with different set.seed values, compute RMSEs etc.

gunes · Accepted Answer · 2021-01-13T12:55:55.930

Yes, tree based methods are good at detecting interactions, but not always. For example, using $x_1\lessgtr0$ and $x_2\lessgtr0$ in subsequent levels of the tree would be equivalent to using $x_1x_2\lessgtr0$ on one level. That said, since you set a max depth hyperparameter in random forests, adding promising interactions will decrease your overall depth and paves the way for more performance via leaving the remaining levels for other features.

Are Random Forests good at detecting interaction terms?

2 Answers2

Linked