5

I am trying to prove myself that stepwise method should not be used. Indeed we are often modeling data likewise at my work. I have recently bought the very interesting book of Frank Harrell (Regression Modeling Strategies). In section 4.3 Variable selection he states the following:

But using $Y$ to compute $P$-values to decide which variables to include is similar to using $Y$ to pool treatment in a five-treatment randomized trial, and then testing for global treatment differences using fewer than four degrees of freedom.

He gave a similar explanation in a post here on CrossValidated but I do not get both (pooling then testing for global differences).

I understand that there is a problem of tests multiplicity but I would like to have a more technical proof or more details regarding these examples.

Stefan
  • 6,431
Fed
  • 71

1 Answers1

1

For what it may be worth , I'm trying to give my explaination.

One reason for defining stepwise selection as a bad procedure, it's that at any step the model is fitted using classical least square , i.e unconstrained. If you are planning to do feature selection, it usually means that you are in a scenario where $p>>n$. To find $\beta$ OLS try to invert the matrix $X^tX$ which it's not invertible in this case. So you should prefer method like Lasso, which are OLS constrained.

Second reason: stepwise procedure is suboptimal by definition. Each variable is selected in a greedy way and the algorithm can't simply know if it has found a global optimum or just a local optimum.

I'd add that there's a general problem with feature selection: people forget that if you use your data twice, to perform feature selection and then to carry out any inference on your data you are introducing a substantial bias in your estimation. Read this: http://www.maths.bath.ac.uk/~jjf23/papers/interface98.pdf

Also there's a problem with multiple testing. if you don't correct your hypothesis testing (check bonferroni corrections for example) you end up uncorrectly rejecting a hypothesis test that it is indeed true https://www.stat.berkeley.edu/~mgoldman/Section0402.pdf

A good way to do feature selection that exploits lasso method : https://www.stat.cmu.edu/~ryantibs/journalclub/stability.pdf