What are the indications that one should be using interaction variables in their linear regression model?

Question

I am on page 87 of ISLR 2nd edition.

What are the indications that one should be using interaction variables in their linear regression model? Basically: When do you know "ah, I should be using interaction variables"?

Are there any general rules or hard-and-fast rules that I should be looking out for?
Eg. If 2 variables on a standalone basis are statistically significant, then you should look into creating an interaction variable with the both of them? Or do we just manually go through all the possible combinations of the variables, create interaction variables out of them, and see via trial and error if the interaction is significant or not? Or is it just conjectured via gut feeling, like we just "guess" that these two variables may have synergy.

Would love rules from a trusted source that detail step-by-step what is the process for selecting interaction variables (eg. start from the variables with the highest coefficient/significance and create an interaction variable from them, and then...)

score 6 · Accepted Answer · answered Dec 30 '22 at 14:02

The short, but perhaps unsatisfying answer is: when you have a prior reason to think that the effect of one variable might depend on what's going on with another variable.

For example, let's say I'm trying to model student scores on a math test as a function of math test scores in the previous year and a binary variable indicating whether the student attended a (randomly assigned) refresher course in rudimentary math topics.

Given that the course only covered rudimentary abilities there are good theoretical reasons to think that it might produce a bigger impact on test scores for students who started at a lower baseline, and little or no impact on those students who were already doing well (and thus already knew the rudimentary topics it covered). So I should include an interaction term between prior test scores and course attendance to test if this is true or not (I would predict that the interaction term coefficient would be negative and significant in this case).

Note that this decision was made purely based on my prior theoretical understanding of how the variables should (or could) work. I didn't run a model first without the interaction term and then check some diagnostic or run some post hoc tests.

In general, when you are trying to decide how to specify a model - including whether to include interaction terms - you really want to try to make these decision based on prior theory and literature. It can be tempting to search for some sort of algorithmic approach ("if this number here is less than .05, then include an interaction") as you seem to be doing, but these approaches tend to cause big problems in practice - like unintentional p hacking. See prior discussions here about the problems with other attempts to specify models using "algorithmic" approaches.

In the case of interaction terms - there are always a large number of interaction terms that you COULD specify in any model. But if you try and check them all you will end up causing a multiple comparisons problems - you will find one of them to be significant at the .05 level just due to random chance, because you ran so many statistical tests. Plus some of these interaction terms - even if significant - will just make no substantive sense. Finally, including interaction terms in a model eats up degrees of freedom, makes the model harder to interpret, and reduces statistical power. So you only want to include an interaction term if you think that the benefit (in terms of interpretation and model fit) outweighs these costs.

In short: take a step back from diagnostics and think about what the variables you are considering for your model are actually doing, and why and how they might relate to the dependent variable. If you can think of a good substantive reason why the effect of one variable might depend on the level of another variable, then consider testing for an interaction between them.

+1 Something interesting about doing model diagnostics and tweaking the model based on them is that doing so does affect downstream inferences like p-values, confidence intervals, and even pure predictive performance. — Dave, Dec 30 '22 at 16:00
Thank you for the typing this all out. 1) Am wondering if this is standard practice? Basically where did you get this from? Is it cited in a textbook/reputable source somewhere? 2) Do you apply this reasoning to any 2 variables regardless of their significance as standalone values? Or do you only apply this to variables that are significant? — Katsu, Dec 30 '22 at 19:13
@Katsu Frank Harrell's course notes provide a superb overview of how to approach regression modeling, especially Chapter 4. You certainly don't rely on apparent "significance" in choosing predictors or interactions. Section 2.7.2 deals with interactions and provides examples of interactions to evaluate in clinical studies. In general, it's best to choose interactions based on understanding of the subject matter. Some modeling tools like boosted trees can discover interactions that can be useful for prediction, but not usually in any easily interpreted way. — EdM, Dec 30 '22 at 23:13
Got it, so its more of an art than science. At least that is the accepted "go-to rule"! — Katsu, Dec 31 '22 at 02:07

score 1 · Answer 2 · answered Jan 03 '23 at 21:34

The prior for any interaction is that it is likely very, perhaps vanishingly, small (see Tosh 2021). As such, you should not explore or add-in interaction terms unless you expect them, a priori, to be large or informative. Doing so creates a massive risk for false-positives. Generally, one should only add in interactions when they are well-established or when the interaction is the question of interest (one might also add in specific interaction terms as confounders when testing an interaction of interest in a regression - see Keller 2014).

What are the indications that one should be using interaction variables in their linear regression model?

2 Answers2

Linked