intuition linear regression stepwise selection of predictors

Question

I am using a tool in genetics, which works very similar to stepwise linear regression (its called GCTA-COJO for those interested). Essentially the starting situation looks like this:

You have 1000s of predictors (called SNP1, SNP2, ...) and a continous trait Y you want to predict. In the first step, you would fit a linear model which would only include one of the predictors (i.e. lm(Y~SNP1), lm(Y~SNP2), ...).

In the second step, you fit a joint model, which includes multiple predictors (i.e. lm(Y~SNP1+SNP5+SNP159)). Here, I noticed, that often times predictors which are not significantly associated with the outcome Y in the "single predictor" association test (i.e. lm(Y~SNP1)) become significant, when they are included in the joint model (lm(Y~SNP1+SNP5+SNP159)). Somehow, I have a very hard time understanding how such a thing is possible, and would love if somebody could give me an intuitive explaintion which this is possible.

Here is some example data, which describes my situation reasonably well:

library(broom)
library(tidyverse)
set.seed(123)  # Set seed for reproducibility
Generate 100 observations for each variable
n <- 100
Generate predictors with varying correlations
predictors <- matrix(rnorm(20 * n), nrow = n)
predictors[, 1:5] <- predictors[, 1] + rnorm(n, sd = 0.2)  # Introduce some correlation
predictors[, 6:10] <- predictors[, 5] + rnorm(n, sd = 0.3)  # Introduce more correlation
Generate the outcome variable with varying relationships to predictors
Y <- 2 + 0.5 * predictors[, 1] + 0.8 * predictors[, 5] + 0.3 * predictors[, 10] + rnorm(n)
Combine into a data frame
data <- data.frame(Y, predictors)
names(data)[-1] <- paste0("X", 1:20)  # Assign column names
Store the names of the 20 predictors
predictors <- names(data)[-1]  # Assuming Y is the first column
Fit individual linear regressions for each predictor
for (predictor in predictors) {
  model <- lm(formula = Y ~ data[[predictor]], data = data)
  print(predictor)
  print(tidy(model))
}
Full model with all predictors
full_model <- lm(Y ~ ., data = data)
Stepwise selection using AIC
best_model <- step(full_model, direction = "both")  # Consider both forward and backward selection
Print the summary of the selected model
summary(best_model)

In this example, predictor13 is not significant in the single variant association test, but in the joined model (best_model), we can see that it becomes significantly associated. This is surprsing for me, given that it does not seem like there is any association between the two variables:

Any insights into why this is possible are very much appreciated!

Cheers!

This problem is crying out for penalized regression. – Frank Harrell Dec 24 '23 at 08:57 — Frank Harrell, Dec 24 '23 at 08:57

Peter Flom · Accepted Answer · 2023-12-23T19:33:46.170

First, your full model has huge collinearity problems:

library(olsrr)
ols_coll_diag(full_model)

reveals that the highest condition indexes are NaN, but the 17th is on the order of $10^{19}$.

Second, stepwise is a very problematic method of variable selection. Basically, all the output is wrong. p values are too low, standard errors too small, and parameter estimates biased away from zero. This has been covered here many times.

Finally, to answer your question (and with all that said), the parameter estimates in a regression are after controlling for the other variables that are in the model. When two of those variables are correlated, then each is after the other effect has been removed.

To give a made up example, but with real variables: Suppose your independent variables include the length of various body parts and you are doing logistic regression for some particular condition. If that condition involves the abnormal length of legs compared to torso, but does not involve overall size, then it is perfectly reasonable for leg length to be significant when torso is included, but not when it is not.

+1. Here is our canonical thread on why stepwise variable selection is usually a bad idea: https://stats.stackexchange.com/q/20836/1352 — Stephan Kolassa, Dec 23 '23 at 16:36

score 0 · Answer 2 · answered Dec 23 '23 at 15:12

0

a key phrase to think about is "after controlling for"... Eg maybe there is one variable that drives most of the variation, only after taking that effect out can you see the effects of the other variables.

So the question is how the other variables covary with 13 and dependent variable. eg perhaps most of the variation in Y is due to var 12. once you have got rid of that variation you might see a cleaner relationship between Y and var 13.

You can actually perform multiple regression in a stepwise matter

calculate residual of Y on $X_1$, $r_{Y,X_1}$

calculate residual of $X_2$ on $X_1$, $r_{X_2,X_1}$

Now regress the residuals $r_{Y,X_1}$ on $r_{X_2,X_1}$ ( and similarly do a 1d scatter plot)

answered Dec 23 '23 at 15:12

seanv507

6,743

Thank you very much for your answer! I was wondering if you could explain the procedure of the "stepwise multiple regression" a bit more. I thought it would rather look something like this:
x1_residuals <- residuals(lm(formula = Y ~ data[["X1"]], data = data))

summary(lm(formula = x1_residuals ~ data[["X13"]]))

However, in this regression I get a pvalue of 0.078 instead of 0.037 (which I did for the multiple regression=
– nickhir Dec 26 '23 at 17:58
It's section 3.2.3 of elements of statistical learning which is freely available online The Elements of Statistical Learning - Trevor Hastie - Stanford Domains https://hastie.su.domains/Papers/ESLII.pdf – seanv507 Dec 26 '23 at 23:23
The second regression should be using the residual of x13 on x1 as independent variable – seanv507 Dec 26 '23 at 23:25

intuition linear regression stepwise selection of predictors

Generate 100 observations for each variable

Generate predictors with varying correlations

Generate the outcome variable with varying relationships to predictors

Combine into a data frame

Store the names of the 20 predictors

Fit individual linear regressions for each predictor

Full model with all predictors

Stepwise selection using AIC

Print the summary of the selected model

2 Answers2