How to ensure that the most appropriate value for lambda is chosen in lasso?

Question

My situation:

small sample size: 116
binary outcome variable
long list of explanatory variables: 50
explanatory variables did not come from the top of my head; their choice was based on the literature.

Following a suggestion to a previous question of mine, I have run LASSO (using R's glmnet package) in order to select the subset of exaplanatory variables that best explain variations in my binary outcome variable.

I have noticed that I get very different values of lambda.min through k-folds cross-validation (cv.glmnet command) according the value I attribute to k. I have tried the default (10) and 5. Which would be the most appropriate value for k, considering my sample size?

In my specific case, is it necessary to repeat cross-validation, say 100 times, in order to reduce randomness and allow averaging the error curves, as is suggested in this post? If so: I have tried the code suggested in that post, but got error messages, could anyone suggest a better code?

UPDATE1: I have managed to use the foldid option in cv.glmnet, as suggested in the comments below, by organizing my x-matrix in a way that all the 32 observations belonging to one of my outcome classes appears in lines 1-32 and by using the folowing code: foldid=c(sample(rep(seq(10),length=32),sample(rep(seq(10),length=84)). However, when I ran cv.glmnet, only one of the levels of a categorical variable with four levels was included in the model. So following a suggestion to a previous question of mine, I tried to run group-lasso using R's gglasso package. And now I am facing this issue.

score 2 · Answer 1 · answered Sep 30 '14 at 16:21

2

10-fold cross-validation often is considered as the gold standard because of the compromise between bias and variance. If I understand correctly (because statistics and machine learning is not my main topic), if you go to larger number of folds, your error estimate will greatly depend on your data. As a consequence, the error estimate will have high variance and low bias.

I would say, if you know that another set of samples would show approximately the same values as you have (that means you have low variance), you can use larger number of folds (even LOOCV). Otherwise, I would leave 10-fold CV.

answered Sep 30 '14 at 16:21

Kirill

652

2

A potential problem here is that, as stated by the OP in other questions, there are only about 30 "events" in these binary-outcome data. So there's a reasonable chance that, in some of the 10-fold test samples, there will be no events, perhaps limiting the ability to cross-validate. – EdM Sep 30 '14 at 16:41
@Kirill: Would you indicate some refence where I could read more about the 10-fold standard and the compromise between bias and variance? – Puzzled Sep 30 '14 at 16:47
@EdM: By "events", do you mean the sample size for the smaller of the two outcome categories? Would your argument justify the use of a smaller value for k? If so, how could I define the exact k value to be used? – Puzzled Sep 30 '14 at 16:52
1

@EdM In that case, I would advise composing folds via sampling on the outcome. Because this is a logistic regression, this will only invalidate the intercept term (which is not estimated in lasso anyway), but not the coefficients' estimates. Substitute the appropriate intercept term value when assessing performance: the ML estimate of the intercept is fixed at the ratio of outcome 1 to outcome 0, so it's basically meaningless anyway. – Sycorax Sep 30 '14 at 16:56
@user777: Sorry, I am not an statistician, I guess that is why I did not quite get that. As I had understood it, the folds are composed automatically and randomly when I run the cv.glmnet command. Are you suggesting that I interfere in the way folds are composed? If so, how can I do that? And what do you mean by "substute the appropriate intercept term value"? – Puzzled Sep 30 '14 at 17:07
1

@Puzzled I'm saying that you select folds such that every fold is composed of the same number of observations from each class. This is easy enough to do: create a data frame of class 1 and sample from it. Do the same for class 0. Concatenate them, train the model, test out-of-sample performance, and repeat. – Sycorax Sep 30 '14 at 17:11
@user777: Sorry again, but I guess that, apart from creating the 2 data frames, that does not sound so easy to me. How could I code that using R's glmnet package? – Puzzled Sep 30 '14 at 17:58
@Puzzled Asking for code is off-topic at CV, but it is on-topic at Stack Overflow, provided you have a reproducible example. In any case, it wouldn't fit in a comment, and this question is already answered; k-fold CV is how one selects $\lambda$ in a lasso. – Sycorax Sep 30 '14 at 18:02
2

Yes, by "events" I mean the smaller of the 2 outcome categories. I like the solution proposed by @user777, to sample separately from each of the outcome classes. Choose your folds by sampling before you run your LASSO, and use the foldid argument to let cv.glmnet know which fold each case belongs to. – EdM Sep 30 '14 at 20:18
1

@Puzzled There is a good book named 'Introduction to statistical learning with applications in R' (http://www-bcf.usc.edu/~gareth/ISL/). It is a very easygoing book and I liked it a lot. Have a look on the chapter about cross-validation. Moreover, lasso is described quite vividly. – Kirill Oct 01 '14 at 21:13
@Kirill: Excellent book, indeed! However, in p.184, it recommends both k=5 and k=10. Why would you recommend k=10 for my case? – Puzzled Oct 05 '14 at 19:25
@EdM: Please have a look in my latest update above, regarding foldid. – Puzzled Oct 05 '14 at 19:27
Err, a little confused from the original answer. This particular situation may be different due to small sample size, but you should never normally draw your error estimate from the same round of CV used for hyperparameter selection. Using one round of CV for selecting hyperparameters and estimating error is basically a form of in sample error estimation, and highly optimistic compared to doing CV correctly. Am I missing something? – Nir Friedman Oct 05 '14 at 22:33
@Puzzled: No, I didn't recommend. Then it's up to you to decide =) I wouldn't separate the data according to binary outcome because for me it's kind of intervention (but not so huge I think). I would ran CV for all possible folds (fortunatelly, it will be quite fast for this amount of samples and variables). Then I would have a look whether there is a stable picture that changes slightly if you change the number of folds. And I would take something in the middle. Correct me if I'm saying bullsh*t. (I guess that this middle position will correspond to something like about 10 folds) – Kirill Oct 06 '14 at 10:25
You need to check the model matrix presented to glmnet to make sure that all levels of your categorical variables are encoded correctly; see http://stats.stackexchange.com/questions/72251/an-example-lasso-regression-using-glmnet-for-binary-outcome for how to construct and use the model matrix. For your application, you might want to consider using a sparse model matrix, supported by glmnet. You may have better luck getting help on these coding-specific issues on Stack Overflow or the R-help mailing list. – EdM Oct 06 '14 at 14:44
@EdM: As I had few categorical variables with more than 2 levels, I did the dummy coding by hand, instead of using the model.matrix command. Does it make any difference? I mean, by using model.matrix, would glmnet treat the different levels of a factor as a group? – Puzzled Oct 07 '14 at 12:39
As I understand it, glmnet does not allow grouping of different levels of a predictor variable. If your coding is correct, then you have found that only 1 of the 4 levels of the factor in question is needed for your LASSO-selected model. As you have learned, the alternative that does allow grouping of levels of a factor, gglasso, runs very slowly in comparison. Consider whether you actually need the grouping for your purpose: if only one level of a categorical variable out of 4 matters for prediction, do you really care about the other 3 levels? – EdM Oct 07 '14 at 14:22

How to ensure that the most appropriate value for lambda is chosen in lasso?

1 Answers1