8

When we want to do inference on parameters or nested models, stepwise variable selection causes a number of problems, discussed by Frank Harrell and others.

However, if we validate the stepwise model-building on some kind of holdout data, we would be getting an honest assessment of how our modeling performs, which we do not get from an adjusted $R^2$ that only considers selected variables.

How competitive is stepwise regression in this setting of pure prediction compared to alternatives that deal with large numbers of candidate variables (e.g., ridge or LASSO regression)? Simulations and published results (simulated, empirical, or theoretical/theorems) would be appreciated!

While I can believe that a stepwise procedure would underperform competitors, it is not as obvious why that is the case as it is that the usual adjusted $R^2$ is an inappropriate measure of performance.

Dave
  • 62,186
  • 3
    There's a few issues related to prediction from regression models ... but I'm not sure they're quite what you're asking. One issue: sometimes even predicting the 'true' model (one you used to generate data) can be worse in MSPE terms than a regularized one; that sort of effect tends to be worse with stepwise selection. Secondly, if you want to predict, you will need to estimate parameters on a different subset than you used for variable selection, or you'll have the usual bias in estimates and standard errors (indeed in variance covariance matrices) which impact the properties of prediction. – Glen_b Oct 31 '22 at 01:06
  • @Glen_b Do you mean that one data set results in the true model not being the best predictor or that this can happen regularly (in expected value or something)? I can believe the former (we happen upon all kinds of bizarre behavior at least once in a while), but the latter would be strange behavior to say the least! – Dave Nov 08 '22 at 14:51
  • Not specific to particular data sets; for some models and population parameter values you could generate many data sets and have it happen commonly and definitely "on average". Consider, just for an example, a model that's nearly linear in a predictor that's (say) closely correlated with time (or for simplicity imagine it's $t$), but the model actually (in the population) has a smallish cubic-like term. Now ... 1. Imagine there's just about enough data to typically pick up that it's not linear; further imagine that we do consider a cubic model. ... – Glen_b Nov 08 '22 at 15:45
  • ... Now predict for some number of periods into the future (it needn't be far). The uncertainty in the estimate of that nonlinear term is going to be large so the predictions are extremely volatile (high MSPE), but at least it's unbiased. You'll get a much smaller MSPE if you shrink coefficients (trading variance for a little bias). 2. Now introduce stepwise selection (or indeed other forms of variable selection); on average across models where the term is included, its coefficient is biased away from zero, instead of shrunk toward it. Worse MSPE – Glen_b Nov 08 '22 at 15:45
  • NB that's an effect over data sets generated from a given model, not conditional on the specific data sets. There's nothing particular to the choices in the example either; a wide variety of small effects that are (a) not accurately estimated but that we can typically tell aren't zero and (b) that can have larger effects when predicting than they typically do in the data at hand will produce really strong effects of this sort (I've seen MSPEs well over 50 times as big for the model that generated the data than with good choices of shrinkage). – Glen_b Nov 08 '22 at 15:54
  • You don't actually need (b) to hold to see it happen -- (a) is often sufficient -- but (b) tends to make it more dramatic, so it hits you between the eyes. It's decades since I played with examples of this sort but hopefully the gist is sufficiently clear to see what's going on and why it will happen. – Glen_b Nov 08 '22 at 15:55
  • Related: https://stats.stackexchange.com/questions/89202. A highly upvoted answer suggests that stepwise may be better or worse than LASSO. – Richard Hardy Nov 10 '22 at 16:16
  • 1
    The formulation of the question is quite loaded, i.e. it presumes stepwise regression is bad at prediction. The formulation might surely be a matter of taste, but I would consider going for a more neutral one instead. – Richard Hardy Nov 10 '22 at 16:36

1 Answers1

5

Stepwise regression is not generally bad at prediction, in the sense that it is not generally worse than, say, LASSO or best subset selection. Which is to say it may be quite good! Recent evidence on that can be found in Hastie et al. (2020). (I regard the authors as pretty much the ultimate experts in the field.) Noting that LASSO does not universally dominate ridge nor the other way around (see e.g. Tibshirani (1996)), I conjecture that stepwise regression may do better or worse than ridge, too.

References:

Richard Hardy
  • 67,272
  • 2
    I think I have read several solid papers where stepwise regression performed comparably to LASSO, but I cannot quite remember the titles. – Richard Hardy Nov 10 '22 at 16:42