What exactly is "stepwise model selection"?

Question

Although the merits of stepwise model selection has been discussed previously, it is becoming unclear to me what exactly is "stepwise model selection" or "stepwise regression". I thought I understood it, but not so sure anymore.

My understanding is that these two terms are synonymous (at least in a regression context), and that they refer to the selection of the best set of predictor variables in an "optimal" or "best" model, given the data. (You can find the Wikipedia page here, and another potentially useful overview here.)

Based on several previous threads (for example here: Algorithms for automatic model selection), it appears that stepwise model selection is considered a cardinal sin. And yet, it seems to be used all the time, including by what seem to be well respected statisticians. Or am I mixing up the terminology?

My main questions are:

By "stepwise model selection" or "stepwise regression", do we mean:
A) doing sequential hypothesis testing such as likelihood ratio tests or looking at p-values? (There is a related post here: Why are p-values misleading after performing a stepwise selection?) Is this what is meant by it and why it is bad?
Or
B) do we also consider selection based on AIC (or similar information criterion) to be equally bad? From the answer at Algorithms for automatic model selection, it appears that this too is criticized. On the other hand, Whittingham et al. (2006; pdf)¹ seems to suggest that variable selection based on information-theoretic (IT) approach is different from stepwise selection (and seems to be a valid approach)...?

And this is the source of all my confusion.

To follow up, if AIC based selection does fall under "stepwise" and is considered inappropriate, then here are additional questions:
If this approach is wrong, why is it taught in textbooks, university courses, etc.? Is all that plain wrong?
What are good alternatives for selecting which variables should remain in the model? I have come across recommendations to use cross-validation and training-test datasets, and LASSO.
I think everyone can agree that indiscriminately throwing all possible variables into a model and then doing stepwise selection is problematic. Of course, some sane judgement should guide what goes in initially. But what if we already start with a limited number of possible predictor variables based on some (say biological) knowledge, and all these predictors may well be explaining our response? Would this approach of model selection still be flawed? I also acknowledge that selection of the "best" model might not be appropriate if AIC values among different models are very similar (and multi-model inference may be applied in such cases). But is the underlying issue of using AIC-based stepwise selection still problematic?

If we are looking to see which variables seem to explain the response and in what way, why is this approach wrong, since we know "all models are wrong, but some are useful"?

_{1. Whittingham, M.J., Stephens, P.A., Bradbury, R.B., & Freckleton, R.P. (2006). Why do we still use stepwise modelling in ecology and behaviour? Journal of Animal Ecology, 75, pp. 1182–1189.}

both AIC and p-value are misleading using stepwise regression! You can find an intuitive explanation here with an example of stepwise regression using AIC here: https://metariat.wordpress.com/2016/12/19/how-bad-is-stepwise-regression/ — Metariat, Mar 09 '17 at 14:27
Could you clarify what exactly is unclear for you in the Algorithms for automatic model selection thread you refer to...? It seems it answers all of your questions, giving pretty detailed answer. Answering the basic question: stepwise model selection is taking regression with a number of predictors and then dropping one at a time (or adding one at a time) based on some criteria of model improvement until finding the "best" model. — Tim, Mar 09 '17 at 14:40
@Tim, apologies for the delayed response. Well no, I don't think it answers all of my questions and several issues remain unclear (to me). 1), I wanted to clarify the terminology, as various sources use different terms, so I wanted to understand thoroughly whether the terms I'm referring to are synonyms or not. 2) While I could understand from that thread that the problems are the same regardless of criteria used, there are inconsistency in that in the literature. 3) when reading papers and books, there seems to be disagreement over what is appropriate and what isn't (or when). — Tilen, Mar 13 '17 at 17:38
Bottom line, even though the thread on Algoriths for automatic model selection was very informative and useful, it still left me with loads of questions and confusion. — Tilen, Mar 13 '17 at 17:43

score 8 · Answer 1 · answered Mar 13 '17 at 18:15

1) The reason you're confused is that the term "stepwise" is used inconsistently. Sometimes it means pretty specific procedures in which $p$-values of regression coefficients, calculated in the ordinary way, are used to determine what covariates are added to or removed from a model, and this process is repeated several times. It may refer to (a) a particular variation of this procedure in which variables can be added or removed at any step (I think this is what SPSS calls "stepwise"), or it may refer to (b) this variation along with other variations such as only adding variables or only removing variables. More broadly, "stepwise" can be used to refer to (c) any procedure in which features are added to or removed from a model according to some value that's computed each time a feature (or set of features) is added or removed.

These different strategies have all been criticized for various reasons. I would say that most of the criticism is about (b), the key part of that criticism is that $p$-values are poorly equipped for feature selection (the significance tests here are really testing something quite different from "should I include this variable in the model?"), and most serious statisticians recommend against it in all circumstances. (c) is more controversial.

2) Because statistics education is really bad. To give just one example: so far as I can tell from my own education, it's apparently considered a key part of statistics education for psychology majors to tell students to use Bessel's correction to get unbiased estimates of population SD. It's true that Bessel's correction makes the estimate of the variance unbiased, but it's easy to prove that the estimate of the SD is still biased. Better yet, Bessel's correction can increase the MSE of these estimates.

3) Variable selection is practically a field unto itself. Cross-validation and train–test splits are ways to evaluate a model, possibly after feature selection; they don't themselves provide suggestions for which features to use. The lasso is often a good choice. So is best subsets.

4) In my mind, there's still no sense in using (b), especially when you could do something else in (c) instead, like using AIC. I have no objections to AIC-based stepwise selection, but be aware that it's going to be sensitive to the sample (in particular, as samples grow arbitrarily large, AIC, like the lasso, always chooses the most complex model), so don't present the model selection itself as if it were a generalizable conclusion.

If we are looking to see which variables seem to explain the response and in what way

Ultimately, if you want to look at the effects of all the variables, you need to include all the variables, and if your sample is too small for that, you need a bigger sample. Remember, null hypotheses are never true in real life. There aren't going to be a bunch of variables that are associated with an outcome and a bunch of other variables that aren't. Every variable will be associated with the outcome—the questions are to what degree, in what direction, in what interactions with other variables, etc.

Re (4): @gung has 220 upvotes for his criticism of stepwise procedures in http://stats.stackexchange.com/questions/20836, but I think such criticism would apply to AIC-based procedures in exactly the same way as to p-value-based ones. — amoeba, Mar 13 '17 at 21:05
@amoeba Frank Harrell's numbered points seem to apply mostly to (b) (and his point 9 is an advantage, not a disadvantage). Gung's description of how model selection can overfit is correct, but that's what model validation is for, and the problem applies to all model-selection scenarios — Kodiologist, Mar 13 '17 at 21:14
I think gung's answer as well as Frank's points quoted there are about stepwise selection without any external model validation. Clearly, if a stepwise selection is put into a cross-validation loop, then there is no principled problem with it, even if it's based on p-values. If it overfits, we will see it in the cross-validated performance. Criticisms like "It yields R-squared values that are badly biased to be high" only make sense if it's done without cross-validation. — amoeba, Mar 13 '17 at 21:25
@amoeba I suppose, but in line with what I said in my answer, it seems unlikely that $p$-value–based methods would outperform methods based on, e.g., AIC. There's just no mathematical motivation for them. — Kodiologist, Mar 13 '17 at 22:58
@Kodiologist, thank you for answer, it is very helpful. 1) Comments that followed were a revelation for me: I hadn't realised this whole discussion in the other thread was based on a premise of no model validation. I considered model validation an essential part in any case, regardless of variable selection method. 2) With respect to bad teaching, I am still puzzled, as apparently well respected people / universities / books seem to teach or use it. For example, Zuur et al. 2009 (Mixed effects models and extensions in ecology with R), as well as others (Faraway 2005, 2006 if I'm not mistaken). — Tilen, Mar 15 '17 at 16:58
@Tilen You're welcome. 2) I'm surprised, to be honest. Maybe I've overstated how unpopular this method is. 3) You can certainly use cross-validation (like any other model-selection method) as part of a variable-selection procedure, to choose among candidate variable sets; I'm just saying that you need something else to decide what variable sets to check. 4) Yes. — Kodiologist, Mar 15 '17 at 19:48
@Kodiologist I know we're not supposed to provide comments such as "thanks", but I really do want to say thank you again, as this is very useful. :) Would you be willing to take a stab at another related question...? http://stats.stackexchange.com/questions/265572/conflicting-approaches-to-variable-selection-aic-p-values-or-both — Tilen, Mar 15 '17 at 20:20

score 2 · Answer 2 · answered Mar 14 '17 at 16:45

2

Regarding stepwise vs. AIC

Stepwise is a term describing the way a sequence of models is constructed and possibly the way a model is selected within the sequence.

In stepwise model construction, variables are added or removed one by one or in groups according to some rule for defining which of the variables is/are to be added/removed. This is in line with Kodiologist's point (c).
In stepwise model selection, one compares neighbouring models in the sequence and will stop when the model under consideration appears superior to both of its neighbours (the preceding and the succeeding one). This can be done by looking at different properties of the models, e.g. their AIC values, p-values, etc.

Meanwhile,

AIC is a measure of the relative quality of statistical models for a given set of data. (Wikipedia)

AIC can be applied to select a model from a pool of candidates. It may be used as a selection criterion in stepwise selection, but not only.

So stepwise and AIC are two different aspects of model selection that can be used together or separately, and depending on that and on other considerations may or may not be appropriate.

answered Mar 14 '17 at 16:45

Richard Hardy

67,272

thank you for your useful answer as well. Yes, I am aware that AIC may be used separately from stepwise. In my field (biology) however, I am often faced with several candidate predictors for the response. Consequently, constructing a small set of just a few pre-determined models and comparing them (without doing forward or backward stepwise selection, or all possible combinations (dredge)) is often impossible, even with the best available biological knowledge and careful thinking. Any other advice on what the best way in such cases would be? – Tilen Mar 15 '17 at 17:09
@Tilen, Regularized estimation is often a good idea; e.g. elastic net or its special cases (lasso and ridge) can be useful. Partial least squares is another way. – Richard Hardy Mar 15 '17 at 17:11
Thanks, I will look into those. Are these methods that much more complicated from, for example, AIC-based stepwise procedures, or are they just newer? The reason I'm asking is to understand why statistical modelling courses and books (at least introductory or basic ones, but evn applied) seem to contain stepwise procedures (both p-value and AIC based), rather than the methods you referred to. – Tilen Mar 15 '17 at 17:19
@Tilen, probably both. – Richard Hardy Mar 15 '17 at 17:21
I see. I wonder whether you have any thoughts on a directly related, but different question: http://stats.stackexchange.com/questions/265572/conflicting-approaches-to-variable-selection-aic-p-values-or-both? – Tilen Mar 15 '17 at 18:16
I would appreciate some constructive feedback to accompany the downvote. – Richard Hardy Nov 29 '22 at 16:17

What exactly is "stepwise model selection"?

2 Answers2

Linked

Related