Question on statistical significance in single vs multiple linear regression

Question

So I was volunteering on a research project and the main researcher would often talk about how an independent variable could be statistically insignificant in a single linear regression but it could be significant if included with other variables in multiple linear regression or visa versa. I was really confused by this concept but unfortunately he did not have time to explain it to me. I was wondering if anyone on hear could explain this in a clear and simple way? Thanks!

dipetkov · Answer 1 · 2022-06-19T20:44:11.683

Here are two examples illustrating two non-exhaustive but hopefully still instructive situations.

Example #1

We simulate a dataset with two continuous predictors, x1 and x2, and a continuous response y so that on its own x2 is "not significant" in a single linear regression of y on x2 but it is "significant" in a multiple linear regression of y on both x1 and x2.

Both predictors are linearly associated with y because x1 and x2 separately "cause" y without interacting with each other. However, most of the variation in y is explained by x1 and only a little of the variation in y is explained by x2. If we don't include x1 in the regression the residual variation is very high and the x2 coefficient doesn't reach significance.

Lesson learned: Even if we are primarily interested in studying the relationship between a particular predictor x2 and the outcome y, it's helpful to adjust the analysis for known covariates, esp. if these covariates explain a lot of the observed variability of y.

anova(lm(y ~ x2, data = sim))
#> Analysis of Variance Table
#> 
#> Response: y
#>           Df  Sum Sq Mean Sq F value Pr(>F)
#> x2         1   22.59  22.594  0.8305 0.3644
#> Residuals 98 2666.09  27.205
anova(lm(y ~ x1 + x2, data = sim))
#> Analysis of Variance Table
#> 
#> Response: y
#>           Df  Sum Sq Mean Sq  F value    Pr(>F)

#> x1         1 2562.36 2562.36 2766.744 < 2.2e-16 ***
#> x2         1   36.48   36.48   39.394 9.723e-09 ***
#> Residuals 97   89.83    0.93

The same phenomenon explained with lots more words: How can adding a 2nd IV make the 1st IV significant?

Example #2

We simulate a dataset with two continuous predictors, x1 and x2, and a continuous response y so that on its own x2 is "significant" in a single linear regression of y on x2 but it is "not significant" in a multiple linear regression of y on both x1 and x2.

Both predictors are linearly associated with y because x1 and x2 are correlated with a third latent variable x0 which generates y "behind the scenes". The correlation between x2 and x0 is weaker than the correlation between x1 and x0, so once we include x1 in the regression, there is no systematic variability left for x2 to explain.

Lesson learned: Association is not causation. Even the simplest of simulations illustrates that regression tells us only about the associations between the known predictors and the response, not about the true data generating process.

anova(lm(y ~ x2, data = sim))
#> Analysis of Variance Table
#> 
#> Response: y
#>           Df Sum Sq Mean Sq F value    Pr(>F)    
#> x2         1 124.27 124.273  106.47 < 2.2e-16 ***
#> Residuals 98 114.39   1.167
anova(lm(y ~ x1 + x2, data = sim))
#> Analysis of Variance Table
#> 
#> Response: y
#>           Df  Sum Sq Mean Sq  F value Pr(>F)

#> x1         1 133.899 133.899 123.9919 <2e-16 ***
#> x2         1   0.012   0.012   0.0116 0.9146

#> Residuals 97 104.750   1.080

The R code in all its gory details:

library("broom")
library("ggeffects")
library("tidyverse")
library("patchwork")
Example 1
set.seed(1234)
n <- 100
sim <- tibble(
  x1 = rnorm(n, sd = 1),
  x2 = rnorm(n, sd = 2),
  noise = rnorm(n),
  y = 5 * x1 + 0.25 * x2 + noise
)
m.2 <- lm(y ~ x2, data = sim)
m12 <- lm(y ~ x1 + x2, data = sim)
anova(m.2)
anova(m12)
p.2 <- ggpredict(m.2, terms = c("x2")) %>% plot()
p.2 <- p.2 + ggtitle("Predicted values of y given x2")
p12 <- ggpredict(m12, terms = c("x2", "x1 [-2,0,2]")) %>% plot()
p12 <- p12 + ggtitle("Predicted values of y given x1 and x2")
p.2 + p12
Example #2
set.seed(1234)
n <- 100
sim <- tibble(
  x0 = rnorm(n, sd = 1),
  x1 = x0 + rnorm(n, sd = 0.1),
  x2 = x0 + rnorm(n, sd = 0.3),
  noise = rnorm(n),
  y = x0 + noise
)
m.2 <- lm(y ~ x2, data = sim)
m12 <- lm(y ~ x1 + x2, data = sim)
anova(m.2)
anova(m12)
p.2 <- ggpredict(m.2, terms = c("x2")) %>% plot()
p.2 <- p.2 + ggtitle("Predicted values of y given x2")
p12 <- ggpredict(m12, terms = c("x2", "x1 [-2,0,2]")) %>% plot()
p12 <- p12 + ggtitle("Predicted values of y given x1 and x2")
p.2 + p12

Question on statistical significance in single vs multiple linear regression

1 Answers1

Example 1

Example #2