Does full subset selection suffer from the same handicaps as stepwise regression?

Question

Let's assume $p$ potential predictor variables $X_1,...,X_p$ and a single dependent variable $Y$.

Now I evaluate the performance of all possible linear models considering all possible combinations of predictor variables ($2^p-1$). The performance measure(s) could be pretty much any statistic but first comes to mind $R^2$, $F$-statistic and MSE. Based on them I select the "best" model or the top selection which I can check out more closely.

Intuitively I would assume this is a great idea—but I read around a bit and came across the infamous concept of "stepwise regression" and how it is considered useless by a lot of ( though apparently not all) statistically trained people. The reason seems to be that the assumed distribution underlying the involved statistics does not hold for the scenario.

But stepwise regression is usually described as a slightly different algorithm where you start with a model, adding and removing variables from the model based on a criterion for a statistic.

So my question is whether the approach I describe would also be a type of stepwise regression and hence be handicapped by design. On the latter part (if it is SwReg) I would be interested in clarification on where the handicap comes into play and whether it is possible to amend it.

Yes, it shares the data snooping properties of stepwise selection. In fact, stepwise selection is an efficient approximation to full subset selection. — Michael M, Nov 08 '13 at 09:21
Your original notation of $N, n$ for number of predictors was a minor but still unnecessary distraction (inconsistent too), given their much common use for number of cases or observations. I took the liberty of editing it out. — Nick Cox, Nov 08 '13 at 09:43
Out of the measures you mentioned, $R^2$ and MSE do not seem appropriate to me. The full model (the one including all $p$ regressors) will always give you the highest $R^2$ and the lowest MSE, regardless what the data generating process was. Meanwhile, $F$-statistic may or may not be appropriate, I am not sure. — Richard Hardy, Nov 02 '15 at 15:38
I don't know about "stepwise", and I don't think it qualifies as subset selection, but the Noam Ross approach with glmulti is what I might look at for a starter. http://www.noamross.net/blog/2013/2/20/model-selection-drug.html — EngrStudent, Nov 02 '15 at 15:48
All possible subsets regression has been shown to perform worse than stepwise selection if that's even possible. — Frank Harrell, Nov 02 '15 at 16:25

score 0 · Accepted Answer · answered Nov 04 '21 at 14:25

If you are simply looking for the candidate model that performs best on a given measure (AIC, adjusted $R^2$, whatever), then yes, an exhaustive search makes sense.

However, that is usually not what you want to do. After all, you write:

I select the "best" model or the top selection which I can check out more closely.

If by "check out more closely" you mean "perform inferential statistics on", then no, all subsets is a bad idea, and it does suffer from all the same problems as stepwise variable selection. Specifically, the p values in the final model will be biased low, and you will tend to spuriously believe predictors are significant when they aren't, because you have been fitting noise.

As an illustration, let's simulate a dataset with $n=100$ observations and $k=4$ candidate predictors. The predictors are all uniformly distributed in $[0,1]$, the true parameter values are $(0,2,1,0,0)$ (the initial $0$ is for the intercept, which we will keep in all our models but not consider further), and residual noise is $N(0,1)$. We perform all subsets variable selection and retain the model with the largest adjusted $R^2$. (Go ahead and repeat the exercise with AIC, it will look the same.) Then we take this "optimal" model and do inferential statistics, by recording the parameter estimates and the p values from t tests on them.

We repeat this exercise 5000 times. Here are the parameter estimates (the red horizontal lines show the true parameter values):

We note that there are spurious empty strips around zero, in particular for predictors 3 and 4. Yes, their true parameter values are zero, so they should indeed not appear in the model. However, if they appear in the model, standard theory says the estimates should be (asymptotically) normally distributed around the true value of zero - not "normally distributed with central censoring", i.e., exhibiting a shape like the two tails of the normal distribution.

What is happening here is that conditional on a large adjusted $R^2$, the parameter estimates are "biased away from zero", i.e., they are systematically either too large or too small.

Here are histograms of the p values. For those predictors whose true parameter values are zero, the p values should be uniformly distributed on $[0,1]$, indicated by red lines.

As you see, the p values of the irrelevant (!) predictors are systematically biased towards smaller values. Again, conditional on larger adjusted $R^2$, that is exactly what we should expect. Thus, if we declare the standard $p<.05$ as "significant", we will do so for the irrelevant predictors far too often.

Does full subset selection suffer from the same handicaps as stepwise regression?

1 Answers1

Linked