4

I read an unpublished paper. There is a regression model with about 20 covariates. The authors use a stepwise variable selection method and come to a model with two covariates with small p-values.

The question is: is it correct to consider models with only 2 covariates out of 20? Even for models with all covariates being truly insignificant, we can often find 2 covariates with p<0.05 for both in a multivariable model, just because of 1-st type errors.

What is a reasonable number of covariates to select out of 20 ones?

Some related questions were rosed in https://stats.stackexchange.com/a/20856

Dave
  • 62,186
Viktor
  • 1,007

1 Answers1

7

If the paper calculates the p-values as if they had chosen the stepwise-selected model from the beginning, their methodology is poor and warrants pushback for the reasons given in the linked answer (especially points 2, 3, 4, and 7). This pushback could, but need not, include a recommendation of rejection. Be prepared for the authors to contest this pushback with something like:

This is standard practice in our field!

The gist of point 2, 3, 4, and 7 of the linked answer is that the p-values and confidence intervals are calculated under particular assumptions, and those assumptions do not include conditioning for the stepwise variable selection procedure. Consequently, the t-stats do not have their claimed t-distributions, and the p-values calculated from them are not what they should be.

Dave
  • 62,186
  • 2
    In addition, variable selection doesn't work at what it mainly aims to do. It doesn't select the right variables. The data have insufficient information for doing that. And selection is further ruined by collinearities. – Frank Harrell Nov 27 '22 at 12:23