When It Matters
I agree with Frank that statistical significance really shouldn't be the major criterion here for interpreting your data. If we simply step away from the statistics for a second, why would this interaction make sense? What is driving this interaction? Sometimes interactions are found by chance. Other times it can be because of complex phenomenon. You need to consider this first before you even go forward with interpretation.
Let's say we have a scenario where we obtain the measured temperature on a given day and the caloric intake of some frogs for that same day (see example of frog diet below...).

Sadly, frogs don't always get to live to have another meal. We decide to record the number of frog deaths that day given the number of flies they eat and the temperature outside. There is likely a lot better construction of this experiment (e.g. survival rates over time), but we can just keep the example simple for now. Let us say that during the course of measuring the frogs, we noticed that temperature doesn't really have any practical effect (frogs generally don't die from temperature alone), nor does food intake (I'm not a frog expert, but lets just assume they're efficient with calories).
However, after getting our measurements, our data ends up showing that the combination of high temperature and low caloric intake results in a substantial number of frog deaths. We could perhaps determine, with trepidation as an observational study, that the relationship between temperature and caloric intake is driving the frog deaths, but the influence of one or the other alone doesn't make a difference. It is surmised that frogs exposed to extreme heat and a lack of proper nutrition together causes higher death rates and this is then reported to the scientific community. This is the key to Frank's statements about subject matter knowledge. The context of the question will matter, and whether or not our regression fits the mold can only be rightfully determined by that knowledge.
See here that I have not highlighted statistical significance at all, a point I will elucidate as potentially misleading in the next part of this answer.
When it Doesn't Matter
We can easily simulate a similar scenario where the main effects are statistically non-significant and the interaction is significant. By creating very tiny effects, we can also show that this distinction may not really matter. Below I simulate some data in R to mimic this:
#### Simulate Data ####
set.seed(123)
x <- rnorm(1e4) # normally distributed predictor
z <- rnorm(1e4) # normally distributed control
Create Beta Weight and Error
b0 <- 0 # intercept
b1 <- .0001 # slope 1
b2 <- .0001 # slope 2
b3 <- .05 # interaction
e <- rnorm(1e4) # normal error term
y <- b0 + (b1x) + (b2z) + (b3 * x * z) + e # linear construction of y
df <- data.frame(x,z,y)
Note a few things before we fit the regression:
- The sample size is large ($ n = 10,000$), so we should have excellent power to detect even minor effects.
- The $\beta$ weights, the intercept and slopes, are quite tiny $\beta = .0001$, and I have only adjusted the interaction to have slightly more magnitude to flag our significance tests ($\beta_3 = .05$).
We fit the regression in R below:
#### Fit Regression ####
fit <- lm(y ~ x*z)
summary(fit)
With the following output:
Call:
lm(formula = y ~ x * z)
Residuals:
Min 1Q Median 3Q Max
-3.9996 -0.6647 0.0066 0.6733 3.8627
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.006126 0.010077 0.608 0.543
x -0.012349 0.010091 -1.224 0.221
z 0.004928 0.010062 0.490 0.624
x:z 0.053967 0.010239 5.271 1.39e-07 ***
Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.008 on 9996 degrees of freedom
Multiple R-squared: 0.002928, Adjusted R-squared: 0.002629
F-statistic: 9.784 on 3 and 9996 DF, p-value: 1.926e-06
Some things should already stick out:
- The interaction is the only "significant" term here, but we must consider scale here for all of the betas. If we consider our frog example, with caloric intake and degrees Fahrenheit as predictors, are these effects practically meaningful?
- The adjusted $R^2 = .002629$, which means the model has very little explanatory value despite the model being statistically significant.
Plotting the data can of course elucidate this point. If we simply look at the scatters of $x$ and $z$ on $y$, we can see these two giant clouds of data which seem to have essentially no association with the outcome:

What if we check the interaction between $x$ and $z$? We can do this by first creating some simple slopes:
#### Create Sequence of Values Controlling for +/- SD and Mean of Z ####
newdata1 <- data.frame(
x = seq(
min(x),
max(x),
length.out=200 # sequence of data to get prediction line for x
),
z = mean(z) - (1*sd(z)) # 1 SD below mean of z
)
newdata2 <- data.frame(
x = seq(
min(x),
max(x),
length.out=200
),
z = mean(z) # mean of z
)
newdata3 <- data.frame(
x = seq(
min(x),
max(x),
length.out=200
),
z = mean(z) + (1*sd(z)) # one SD above mean of z
)
Get Prediction Data
pred1 <- predict(fit,newdata = newdata1)
pred2 <- predict(fit,newdata = newdata2)
pred3 <- predict(fit,newdata = newdata3) # predicts model with new data
Then we plot the scatterplot of $x$ and $y$ again, only this time we overlay the predictions when $z$ is at its mean or one standard deviation above/below that mean (indicated by the red lines):
#### Plot Interaction ####
par(mfrow=c(1,1))
plot(x,y,main="Interaction Plot")
lines(newdata1$x,
pred1,
col="red",
lwd=5)
lines(newdata2$x,
pred2,
col="red",
lwd=5)
lines(newdata3$x,
pred3,
col="red",
lwd=5)
Shown below:

Now we can determine some additional facts:
- While our interaction is statistically significant, the plotted lines show the interaction isn't incredibly strong (they almost overlap).
- Looking at the axes of the plot, the scale here really matters. Note here that the values only range between about $[-4,4]$. Low-scale values can be considered large in some contexts (for example if $x$ is BAC content) or essentially meaningless (if $x$ is income in the thousands). The context and research question will determine if these effects and interactions are truly meaningful.
- We still don't know from our simulated example what this means. I have purposely left out what this simulation was supposed to emulate to illustrate an important point: what we are measuring should be important to answering questions of scientific importance. If we don't have any background knowledge driving that inference, then it is hard to determine if it is indeed important. Something like statistical significance has an otherwise limited role in answering those questions.
Model Comparison
Most of this discussion did not really highlight what to do in terms of model comparison. Consider for our simulated example if we fit the models with and without an interaction:
#### Fit Models ####
fit1 <- lm(y ~ x + z)
fit2 <- lm(y ~ x * z)
summary(fit1)
summary(fit2)
As shown here:
> summary(fit1)
Call:
lm(formula = y ~ x + z)
Residuals:
Min 1Q Median 3Q Max
-3.5771 -0.6799 -0.0077 0.6883 4.3606
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.006785 0.010013 -0.678 0.498
x 0.020496 0.010027 2.044 0.041 *
z -0.003838 0.009998 -0.384 0.701
Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.001 on 9997 degrees of freedom
Multiple R-squared: 0.0004315, Adjusted R-squared: 0.0002316
F-statistic: 2.158 on 2 and 9997 DF, p-value: 0.1156
> summary(fit2)
Call:
lm(formula = y ~ x * z)
Residuals:
Min 1Q Median 3Q Max
-3.4854 -0.6830 -0.0075 0.6905 4.3599
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.007047 0.010005 -0.704 0.4812
x 0.020198 0.010018 2.016 0.0438 *
z -0.003170 0.009990 -0.317 0.7510
x:z 0.044362 0.010166 4.364 1.29e-05 ***
Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1 on 9996 degrees of freedom
Multiple R-squared: 0.002332, Adjusted R-squared: 0.002033
F-statistic: 7.789 on 3 and 9996 DF, p-value: 3.43e-05
Comparing these two models by their significant terms doesn't really say much. What we would really need is some kind of other determinations, like AIC/BIC or some chi-square test. We can quickly check these to see if this interaction term is indeed important to include:
#### Check Model Comparison ####
anova(fit1,fit2, test = "Chisq")
AIC(fit1)
AIC(fit2)
The chi-square test is unsurprisingly statistically significant because of the sheer degrees of freedom present from the number of data points:
Analysis of Variance Table
Model 1: y ~ x + z
Model 2: y ~ x * z
Res.Df RSS Df Sum of Sq Pr(>Chi)
1 9997 10023
2 9996 10004 1 19.057 1.278e-05 ***
Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
However, our AIC values have very little difference here:
> AIC(fit1)
[1] 28409.57
> AIC(fit2)
[1] 28392.54
So again, we need to consider the data, the theory, and other important factors when comparing such models even in a strictly statistical way.