Cross-validation and feature selection of a multivariate regression

Question

I've been trying to create a multivariate regression model to fit my training data into the prediction of a value. I've put my data into a matrix X with m x n where m is the number of instances and n the number of features/predictors. My label vector is then m x 1. This is my code to predict the theta values, or parameters.

theta_matrix =  pinv(X'*X)*X'*y_label;

Now I want to slip the data into train and test, and by researching I've found that cross-validation in 10-fold can be a good option. If I do so, wouldn't it get me 10 sets of parameters theta? So what to choose from then?

And about feature selection, I've found that stepwise can be a good choice, but I think it does not take into account that features can be correlated. Any alternative?

There's often confusion about the (primary) use of cross-validation to validate a model-selection procedure for a particular data-set, & its (secondary) use as part of a model-selection procedure. In this case you're cross-validating the procedure that gave you one set of parameter estimates, not picking the best of the ten sets - they're disposable. If you were to do the latter, that procedure would itself need to be validated. Dikran's explanation here is good & clear. — Scortchi - Reinstate Monica, Feb 25 '14 at 11:53

score 4 · Accepted Answer · edited Apr 13 '17 at 12:44

4

You can use cross-validation to understand how your model would behave on completely new or 'unseen' data.

You should also use cross-validation to select which features to use - try sequential feature selection (sequentialfs in Matlab) or Lasso (lasso in Matlab). cvpartition command in Matlab will allow you to setup your test/train partitions for cross-validation, sequentialfs will take a partition object as an input.

Once you have an idea of how your model will behave on unseen data. You can decide what features to use and generate your 'final' model by running the same routine on all available data, which will give a single set of coefficients.

An excellent answer on a similar topic is here

edited Apr 13 '17 at 12:44

Community

1

answered Feb 24 '14 at 13:59

BGreene

3,283

I understand how to use sequenctialfs in a classification problem, not in a regression problem. – SamuelNLP Feb 24 '14 at 14:54
1

Matlab provide an example here: http://www.mathworks.co.uk/help/stats/feature-selection.html#brluyid-1 – BGreene Feb 24 '14 at 14:57
So I have to give the error to the function sequentialfs, would it be the estimated error from the y_label - y_new_label? – SamuelNLP Feb 24 '14 at 15:05
so depending on the partition, the feature selection will differ. – SamuelNLP Feb 24 '14 at 15:06
Yes, different features for each partition - hence the derivation of 'final' model using all data. Sequentialfs seeks to minimize an error function which can be any one desired by the user, y-y^ should work depending on the wrapper function. – BGreene Feb 24 '14 at 15:34
my only doubt is then what features set to choose, from those selected in the selection. – SamuelNLP Feb 24 '14 at 15:35
1

You run the same feature selection routine on all data - this produces your 'final' feature set. See answer here: http://stats.stackexchange.com/questions/2306/feature-selection-for-final-model-when-performing-cross-validation-in-machine?rq=1 – BGreene Feb 24 '14 at 15:39
well, in that case all the data is used as training and there is no test set. I'm confused. I've read that in the link you posted. – SamuelNLP Feb 24 '14 at 16:12
The test/training set used to determine the model performance. If you then want to train a model to use on completely new data you use all data, but if you want to get results for how the model predicts, you need to use cross-validation, bootstrapping or similar. – BGreene Feb 24 '14 at 16:14
let us continue this discussion in chat – SamuelNLP Feb 24 '14 at 16:22

score 1 · Answer 2 · edited Feb 24 '14 at 20:24

"And about feature selection, I've found that stepwise can be a good choice, but I think it does not that into account that features can be correlated. Any alternative?"

Stepwise regression is a great way to do feature selection. Make sure you are using a metric that penalizes for incorporating more features though. For example stepwise regression on just $R^2$ may not be useful, but stepwise regression on AIC may be useful.

In general I remove correlated features before doing stepwise regression (although in theory stepwise regression using AIC shouldn't pull in two correlated features). I do this by looking at pairwise correlation but also by looking at multicollinearity, in particular VIF.

Finally, instead of doing any feature selection, you can do principal components analysis or another dimension-reduction technique to reduce your number of inputs. When I've done in practice, I then use cross-validation to pick how many of the reduced dimensions to use. For example I make $X$ different models that are trained on $1,2,3,4,...X$ lower dimensions and then see which model performs the best in terms of cross-validation.

Readers may be interested in another question about stepwise regression and cross-validation. Pros and cons of stepwise regression came up quite a bit in discussion among some of our local pros responding to this popular question too. — Nick Stauner, Feb 24 '14 at 20:31
@AndrewCassidy Is it possible to add a parameter to take AIC in account on the build-in function stepwisefit of matlab? — SamuelNLP, Feb 25 '14 at 15:20
Sorry @SamuelNLP I'm completely unfamiliar with MatLab. I didn't find this: http://www.mathworks.com/help/stats/linearmodel.stepwise.html. There is a criteria selection there with 'aic' as an option — Andrew Cassidy, Feb 25 '14 at 15:32

Cross-validation and feature selection of a multivariate regression

2 Answers2