Model complexity in cross-validated stepwise regression

Question

Suppose that we perform forward stepwise regression and use cross-validation to choose the best model size.

Using the full data set to choose the sequence of models is the WRONG way to do cross-validation (we need to redo the model selection step within each training fold). If we do cross-validation the WRONG way, which of the following is true?

The selected model will probably be too complex

The selected model will probably be too simple.

The answer was 1, but I didn't quite understand why.

I don't understand what 'choose the sequence of models' means here. In this wrong case, if one chooses the number of regressors based on the entire data set, what exactly is happening in the cross validation loop? It doesn't seem like cross validation is actually being used for model selection in this case. — user20160, Jun 30 '16 at 05:53
I didn't understand the question too. Vague as it was, the answers were also confusing. So posted for help. — kalyan, Jun 30 '16 at 13:54

score 2 · Answer 1 · edited Apr 13 '17 at 12:44

In this hypothetical wrong procedure, it sounds like the number of regressors is chosen to minimize the error over the whole data set. In this case, the model will engorge itself on regressors until using all of them, because the error will always decrease as more are added. The reason is that the error is evaluated using the same data used to choose the weights. This allows the model to overfit (i.e. to fit random structure in the training data that isn't representative of the underlying distribution that produced it). This means that the error will be optimistically biased; when run on new data from the same distribution, the model will have greater error, and will regret its former gluttony.

Regarding answer #1, 'too complex' means more regressors are chosen than should have been, leading to overfitting. This assumes that the 'proper' model includes a smaller subset of the regressors.

That said, using stepwise regression is generally not a good idea in the first place (e.g. see here).

Model complexity in cross-validated stepwise regression

1 Answers1