My situation:
- small sample size: 116
- binary outcome variable
- long list of explanatory variables: 50
- explanatory variables did not come from the top of my head; their choice was based on the literature.
Following a suggestion to a previous question of mine, I have run LASSO (using R's glmnet package) in order to select the subset of exaplanatory variables that best explain variations in my binary outcome variable.
I have noticed that I get very different values of lambda.min through k-folds cross-validation (cv.glmnet command) according the value I attribute to k. I have tried the default (10) and 5. Which would be the most appropriate value for k, considering my sample size?
In my specific case, is it necessary to repeat cross-validation, say 100 times, in order to reduce randomness and allow averaging the error curves, as is suggested in this post? If so: I have tried the code suggested in that post, but got error messages, could anyone suggest a better code?
UPDATE1: I have managed to use the foldid option in cv.glmnet, as suggested in the comments below, by organizing my x-matrix in a way that all the 32 observations belonging to one of my outcome classes appears in lines 1-32 and by using the folowing code: foldid=c(sample(rep(seq(10),length=32),sample(rep(seq(10),length=84)). However, when I ran cv.glmnet, only one of the levels of a categorical variable with four levels was included in the model. So following a suggestion to a previous question of mine, I tried to run group-lasso using R's gglasso package. And now I am facing this issue.
foldidargument to letcv.glmnetknow which fold each case belongs to. – EdM Sep 30 '14 at 20:18foldid. – Puzzled Oct 05 '14 at 19:27glmnetto make sure that all levels of your categorical variables are encoded correctly; see http://stats.stackexchange.com/questions/72251/an-example-lasso-regression-using-glmnet-for-binary-outcome for how to construct and use the model matrix. For your application, you might want to consider using a sparse model matrix, supported byglmnet. You may have better luck getting help on these coding-specific issues on Stack Overflow or the R-help mailing list. – EdM Oct 06 '14 at 14:44model.matrixcommand. Does it make any difference? I mean, by usingmodel.matrix, wouldglmnettreat the different levels of a factor as a group? – Puzzled Oct 07 '14 at 12:39glmnetdoes not allow grouping of different levels of a predictor variable. If your coding is correct, then you have found that only 1 of the 4 levels of the factor in question is needed for your LASSO-selected model. As you have learned, the alternative that does allow grouping of levels of a factor,gglasso, runs very slowly in comparison. Consider whether you actually need the grouping for your purpose: if only one level of a categorical variable out of 4 matters for prediction, do you really care about the other 3 levels? – EdM Oct 07 '14 at 14:22