9

How good random forests actually are at finding interactions between variables? Does adding interaction terms explicitly to the model increase the predictive power of random forests?

gunes
  • 57,205
  • 1
    See also https://datascience.stackexchange.com/a/61279/55122 – Ben Reiniger Jan 10 '21 at 19:14
  • 1
    And to upgrade some Related questions to Linked ones: https://stats.stackexchange.com/q/157665/232706 https://stats.stackexchange.com/q/146087/232706 https://stats.stackexchange.com/q/396923/232706 – Ben Reiniger Jan 10 '21 at 19:24

2 Answers2

11

@gunes has already given the answer, that random forests do work well with interactions. It may help to give a practical example, though. For that, we will need some toy data

set.seed(2021)
n = 5000
x1 <- rnorm(2*n)
x2 <- runif(2*n)
interaction <- x1 * x2
pseudointeraction <- sample(interaction)
x3 <- rbeta(2*n, 2, 10)
epsilon <- rt(2*n, df = 1)

y <- 1x1 + 2x2 + 3x3 + 10interaction

example <- data.frame(x1 = x1, x2 = x2, x3 = x3, x4 = rbeta(2n, 5, 5), x5 = runif(2n,-10, 10), interaction = interaction, pseudointeraction = pseudointeraction, y = y)

Basically we have a y that is a linear function of x1, x2, x3 and the interaction x1*x2. In the dataset there is also a permutation of the interaction term and two more random variables, all of which bear no real information. We split this into a training and an equally sized testing dataset like so:

train <- example[1:n,]
test <- example[(n+1):(2*n),]

As the toy data have been modelled in a linear model, linear regression should be able to solve that puzzle easily:

> linear1 <- lm(y ~ . - pseudointeraction , data = example)
> summary(linear1)

Call: lm(formula = y ~ . - pseudointeraction, data = example)

Residuals: Min 1Q Median 3Q Max -1.071e-12 -3.400e-16 5.000e-17 4.700e-16 1.524e-12

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.641e-14 9.004e-16 -1.823e+01 < 2e-16 *** x1 1.000e+00 4.293e-16 2.329e+15 < 2e-16 *** x2 2.000e+00 7.480e-16 2.674e+15 < 2e-16 *** x3 3.000e+00 2.076e-15 1.445e+15 < 2e-16 *** x4 3.798e-15 1.423e-15 2.669e+00 0.00762 ** x5 2.698e-18 3.725e-17 7.200e-02 0.94225
interaction 1.000e+01 7.415e-16 1.349e+16 < 2e-16 ***


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.145e-14 on 9993 degrees of freedom Multiple R-squared: 1, Adjusted R-squared: 1 F-statistic: 1.647e+32 on 6 and 9993 DF, p-value: < 2.2e-16

The linear modell is wrong in declaring x4 a significant predictor but as it was to be expected, $R^2$ is about 100%.

If we take the interaction term away or exchange it for the permuted interaction term, things get far worse as was to be expected:

> linear2 <- lm(y ~ . - interaction, data = example)
> summary(linear2)

Call: lm(formula = y ~ . - interaction, data = example)

Residuals: Min 1Q Median 3Q Max -17.4533 -1.2764 0.0621 1.2977 14.8616

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.1791910 0.1214622 -1.475 0.140
x1 6.0262464 0.0287402 209.680 <2e-16 *** x2 2.0022413 0.1009082 19.842 <2e-16 *** x3 3.0676424 0.2800356 10.954 <2e-16 *** x4 0.1791824 0.1919885 0.933 0.351
x5 0.0005224 0.0050245 0.104 0.917
pseudointeraction 0.0121493 0.0496305 0.245 0.807


Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.894 on 9993 degrees of freedom Multiple R-squared: 0.816, Adjusted R-squared: 0.8159 F-statistic: 7385 on 6 and 9993 DF, p-value: < 2.2e-16

$R^2$ dropped to 82%. So the interaction term is important in this model.

Let's grow a randomForest with the interaction term and one without it:

library(randomForest)
rf1 <- randomForest(y ~ . - interaction, data = example, mtry = 3)
rf2 <- randomForest(y ~ . - pseudointeraction, data = example, mtry = 3)

print(rf1) print(rf2)

This will take some time to compute and the print results will claim that the random forest with an interaction term given explained 99.67% of the variance while the random forest with the interaction term explained 99.93% of the variance.

It is hard to compare within-sample $R^2$ to Out-of-bag-performance, so we predict our test data set with all four of our models we have so far:

linear1.on.test <- predict(linear1, data = test)
linear2.on.test <- predict(linear2, data = test)
rf1.on.test <- predict(rf1, data = test)
rf2.on.test <- predict(rf2, data = test)

In a next step we compute the absolute residuals and plot them as boxplots:

boxplot(linear1.on.test - test$y, linear2.on.test - test$y, rf1.on.test - test$y, rf2.on.test - test$y,
        ylim=c(-7, 7), xaxt="n", ylab = "residuals")
axis(1, at = 1:4, labels = c("linear with I", "linear w/o I", "rf w/o I", "rf with I"))

enter image description here

So the linear model with interaction term fits best to how the data was constructed and thus predicts best. The linear model without interaction term has no flexibility to adapt and has the worst predictive value of all models.

For the random forests we see quite similar results with and without the interaction term. Stating the interaction term explicitely appears to be helpfull but the model without it is flexible enough to give good predictions without that.

Real world interactions will rarely follow the rules of exact multiplication so the small advantage of computing the interaction term for the model may be even smaller for real world interactions.

All of this will be influenced by sample size, number of features, tuning parameters and so on but the short take away message is that whilst feature building is a good thing, random forest can detect and make use of interactions even if they are not stated in the call to randomForest.

Feel free to run this code with different set.seed values, compute RMSEs etc.

Bernhard
  • 8,427
  • 17
  • 38
7

Yes, tree based methods are good at detecting interactions, but not always. For example, using $x_1\lessgtr0$ and $x_2\lessgtr0$ in subsequent levels of the tree would be equivalent to using $x_1x_2\lessgtr0$ on one level. That said, since you set a max depth hyperparameter in random forests, adding promising interactions will decrease your overall depth and paves the way for more performance via leaving the remaining levels for other features.

gunes
  • 57,205