13

Although the merits of stepwise model selection has been discussed previously, it is becoming unclear to me what exactly is "stepwise model selection" or "stepwise regression". I thought I understood it, but not so sure anymore.

My understanding is that these two terms are synonymous (at least in a regression context), and that they refer to the selection of the best set of predictor variables in an "optimal" or "best" model, given the data. (You can find the Wikipedia page here, and another potentially useful overview here.)

Based on several previous threads (for example here: Algorithms for automatic model selection), it appears that stepwise model selection is considered a cardinal sin. And yet, it seems to be used all the time, including by what seem to be well respected statisticians. Or am I mixing up the terminology?

My main questions are:

  1. By "stepwise model selection" or "stepwise regression", do we mean:
    A) doing sequential hypothesis testing such as likelihood ratio tests or looking at p-values? (There is a related post here: Why are p-values misleading after performing a stepwise selection?) Is this what is meant by it and why it is bad?
    Or
    B) do we also consider selection based on AIC (or similar information criterion) to be equally bad? From the answer at Algorithms for automatic model selection, it appears that this too is criticized. On the other hand, Whittingham et al. (2006; pdf)1 seems to suggest that variable selection based on information-theoretic (IT) approach is different from stepwise selection (and seems to be a valid approach)...?

    And this is the source of all my confusion.

    To follow up, if AIC based selection does fall under "stepwise" and is considered inappropriate, then here are additional questions:

  2. If this approach is wrong, why is it taught in textbooks, university courses, etc.? Is all that plain wrong?

  3. What are good alternatives for selecting which variables should remain in the model? I have come across recommendations to use cross-validation and training-test datasets, and LASSO.

  4. I think everyone can agree that indiscriminately throwing all possible variables into a model and then doing stepwise selection is problematic. Of course, some sane judgement should guide what goes in initially. But what if we already start with a limited number of possible predictor variables based on some (say biological) knowledge, and all these predictors may well be explaining our response? Would this approach of model selection still be flawed? I also acknowledge that selection of the "best" model might not be appropriate if AIC values among different models are very similar (and multi-model inference may be applied in such cases). But is the underlying issue of using AIC-based stepwise selection still problematic?

    If we are looking to see which variables seem to explain the response and in what way, why is this approach wrong, since we know "all models are wrong, but some are useful"?

1. Whittingham, M.J., Stephens, P.A., Bradbury, R.B., & Freckleton, R.P. (2006). Why do we still use stepwise modelling in ecology and behaviour? Journal of Animal Ecology, 75, pp. 1182–1189.

Tilen
  • 820
  • both AIC and p-value are misleading using stepwise regression! You can find an intuitive explanation here with an example of stepwise regression using AIC here: https://metariat.wordpress.com/2016/12/19/how-bad-is-stepwise-regression/ – Metariat Mar 09 '17 at 14:27
  • 4
    Could you clarify what exactly is unclear for you in the Algorithms for automatic model selection thread you refer to...? It seems it answers all of your questions, giving pretty detailed answer. Answering the basic question: stepwise model selection is taking regression with a number of predictors and then dropping one at a time (or adding one at a time) based on some criteria of model improvement until finding the "best" model. – Tim Mar 09 '17 at 14:40
  • 1
    @Tim, apologies for the delayed response. Well no, I don't think it answers all of my questions and several issues remain unclear (to me). 1), I wanted to clarify the terminology, as various sources use different terms, so I wanted to understand thoroughly whether the terms I'm referring to are synonyms or not. 2) While I could understand from that thread that the problems are the same regardless of criteria used, there are inconsistency in that in the literature. 3) when reading papers and books, there seems to be disagreement over what is appropriate and what isn't (or when). – Tilen Mar 13 '17 at 17:38
  • 1
  • one of my questions was also why then is this being taught still (by apparently knowledgeable names), if it is considered wrong. I wanted to understand whether this is a thing of the past (but does not seem to be, given the timing of publishing of certain books), different schools of thought, or simply plain ignorance. 5) I wanted to understand if this approach is wrong even if the starting set of candidate predictor variables is already limited. In other words, my personal interest is in finding a best set of predictors, given an already reduced and well-thought-of set.
  • – Tilen Mar 13 '17 at 17:42
  • 1
    Bottom line, even though the thread on Algoriths for automatic model selection was very informative and useful, it still left me with loads of questions and confusion. – Tilen Mar 13 '17 at 17:43