How to systematically choose which interactions to include in a multiple regression model?

Question

In an answer to this post a user suggests, based on Chapter 3 of the book "The Elements of Statistical Learning" by Hastie et. al, the following means of choosing which interaction effects to include in a model:

Trying out all possible subsets of variables and pick the one that gives a regression with the smallest Bayesian information criterion (BIC) value
Forward or backward stepwise selection

In the comments associated with that answer, both of these approaches are described as being bad.

So, if we shouldn't use method 1) or 2) above, how exactly do we decide what variables/interactions to use in the model? I have seen 'domain knowledge' suggested in a few places, but this seems like a bit of a cop out. Domain matter knowledge is not going to help in the very common situation in which we have no pre-existing knowledge of whether a particular interaction effect is present in nature and we are relying on the information in the data itself.

For the sake of example, suppose we have the predictors - age, gender, height, weight, experience, IQ - and the response variable salary. How do we decide what interaction effects to include/not include?

This example is probably the simplest possible scenario, as we understand all of these variables very well, and even still it is not clear how to decide which interactions to include or exclude. In other situations, we will be dealing with predictor variables for which we have no pre-existing intuition on whether interactions between them could affect the response variable.

So I am looking for a systematic method of choosing which interactions to include in a multiple regression model. How does an experienced statistician choose which interactions to include in the case when domain knowledge is not available or of no use?

Far from being a cop-out, using domain knowledge is an excellent way to avoid overfitting and to ensure the analysis is relevant. Certainly there are many cases where no such knowledge is available--but that only means you don't have recourse to this excellent technique, and therefore are more likely to endure the consequences of overfitting from exploring too many explanatory variables. — whuber, Oct 21 '20 at 14:45
The very first thing your experienced statistician does is explicitly recognize that this analysis is exploratory, and that $p$ values should not even be calculated. The next thing they do is try to convince the non-statisticians on the team of this. After that: anything goes. Use BIC, cross validation, prediction into a holdout sample, ... — Stephan Kolassa, Oct 21 '20 at 14:45
@whuber Even in the example I gave above, there is no comprehensive theory/proof on whether the interaction between, say, height and IQ has an effect on salary. And this is a super simple example, in most fields we have no way of knowing if an interaction between two variables is relevant or not. So should we just put all interactions into the model? Or leave them all out? Or include a specific few (how do we decide which)? I have not been able to find one good answer on how to decide which interactions to include, all I have found are answers telling us what we shouldn't do. — ManUtdBloke, Oct 21 '20 at 14:50
@StephanKolassa And that goes back to my point. Do we use BIC/CV/etc.. over a set of models that feature every possible combination of interactions? If not, how do we choose the interactions to include? — ManUtdBloke, Oct 21 '20 at 14:52
"Even in the example I gave above, there is no comprehensive theory/proof on whether the interaction between, say, height and IQ has an effect on salary." This is precisely where you collect data and fit two models, one with and one without the interaction, and then you test the null hypothesis that the interaction has a coefficient of zero. This is classical NHST, no problem. The problem with overfitting/multiple comparisons starts when you do this with all possible interactions. — Stephan Kolassa, Oct 21 '20 at 14:56
@StephanKolassa Yes that is fine for a single interaction. But my question is when we have several predictors in a multiple regression model, how do we decide which interactions to include? Everything I've read on this topic just talks about what we shouldn't do, I've yet to see an answer that tells us what we should do (apart from use domain knowledge which is very often not possible/useful). So I ask again, how do we decide which interactions to include? — ManUtdBloke, Oct 21 '20 at 15:01
What is the purpose of the model ? Unless it is only prediction and I had enough data for cross validation and hold-out, then if I had multiple variables and potentially all possible interactions, and no domain knowledge to call on, then I would say I had no business working with the data and just walk away. — Robert Long, Oct 21 '20 at 15:27
@RobertLong By "only prediction", do you mean something like training Siri or Alexa to recognize speech? — Dave, Oct 21 '20 at 15:32
@Dave I mean any kind of prediction task where the only goal is predictive accuracy; as opposed to a task where the goal is inference. — Robert Long, Oct 21 '20 at 15:54
One useful heuristic is to focus on creating interactions among those variables with the biggest relative importance in the model. — user78229, Oct 22 '20 at 14:04

Robert Long · Answer 1 · 2020-10-22T13:17:27.320

I think a lot depends on what the purpose of the model is. Inference or prediction ?

If it is inference then you really need to incorporate some domain knowledge into the process, otherwise you risk identifying completely spurious associations, where an interaction may appear to be meaningful but in reality is either an artifact of the sample, or is masking some other issues such as non-linearity in one of more of the variables.

However, if the purpose is prediction then there are a various approaches you can take. One approach would be to fit all possible models and use a train / validate / test approach to find the model that gives the best predictions.

Edit : A simple simulation can show what can go wrong with inference without domain knowledge:

set.seed(50)
N <- 50

X1 <- runif(N, 1, 15)
X2 <- rnorm(N)

Y <-  X1 + X2^2 + rnorm(N)

So, here we posit an actual data generation process of $Y = X_1 + {X_2}^2$

If we had some domain / expert knowledge that suggested some nonlinearities could be involved, we might fit the model:

> lm(Y ~ X1 + I(X1^2) + X2 + I(X2^2) ) %>% summary()
Coefficients:
            Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.89041    0.65047  -1.369    0.178

X1           1.21915    0.19631   6.210 1.52e-07 ***
I(X1^2)     -0.01462    0.01304  -1.122    0.268

X2          -0.19150    0.15530  -1.233    0.224

I(X2^2)      1.07849    0.08945  12.058 1.08e-15 ***

which provides inferences consistent with the "true" data generating procress.

On the other hand, if we had no knowledge and instead thought about a model with just first order terms and the interaction we would obtain:

> lm(Y ~ X1*X2) %>% summary()
Coefficients:
            Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.01396    0.58267  -0.024    0.981

X1           1.09098    0.07064  15.443  < 2e-16 ***
X2          -3.39998    0.54363  -6.254 1.20e-07 ***
X1:X2        0.35850    0.06726   5.330 2.88e-06 ***

which is clearly spurious.

Further edit : However, when we look at predictive accuracy using root mean squared error we find that the interaction model performs slightly better:

> lm(Y ~ X1*X2) %>% predict() %>% `^`(2) %>% sum() %>% sqrt()
[1] 64.23458
> lm(Y ~ X1 + I(X1^2) + X2 + I(X2^2) ) %>% predict() %>% `^`(2) %>% sum() %>% sqrt()
[1] 64.87996

which underlines my central point that a lot depends on the purpose of the model.

Does this answer your question ? If so, please consider marking it as the accepted answer, or if not please let us know why so that it can be improved. — Robert Long, Nov 10 '20 at 18:38
My goal is prediction but fitting all possible models and choosing the best seems impossible as there are an infinite number of possibilities. I have read in many places that methods for overcoming that such as stepwise/subset regression approaches are bad practise. So if (1) we are in the extremely common situation where we don't know what interaction effects exist in nature, and (2) the methods suggested in the TEOSL book are considered bad practise, how can we systematically choose what interactions to include? — ManUtdBloke, Dec 17 '20 at 18:40

How to systematically choose which interactions to include in a multiple regression model?

1 Answers1

Linked