Why are the performance of LASSOs implemented via glmnet in R going down after cross-validation?

Question

This question is a follow-up to this question I asked here last week. I got an important and useful answer to it, but what that led to was surprising. And once again, here is a link to the GitHub Repo for this project.

I am running N LASSO Regressions, one for each of N datasets I have in a file folder to serve as one of several Benchmark Variable Selection Algorithms whose performance can be compared with that of a novel Variable Selection Algorithm being proposed by the principal author of the paper we are collaborating on right now. I originally ran my N LASSOs via the enet function, but tried to replicate my findings via glmnet and lars, each selected different variables. The way I had previously set up my code to run them using the glmnet function was fundamentally misusing the s argument in the coef() function in the middle of the following code block:

set.seed(11)     
glmnet_lasso.fits <- lapply(datasets, function(i) 
               glmnet(x = as.matrix(select(i, starts_with("X"))), 
                      y = i$Y, alpha = 1))
This stores and prints out all of the regression
equation specifications selected by LASSO when called
lasso.coeffs = glmnet_lasso.fits |> 
  Map(f = (model) coef(model, s = .1))
Variables.Selected <- lasso.coeffs |>
  Map(f = (matr) matr |> as.matrix() |> 
       as.data.frame() |> filter(s1 != 0) |> rownames())
Variables.Selected = lapply(seq_along(datasets), (j)
                            j <- (Variables.Selected[[j]][-1]))
Variables.Not.Selected <- lasso.coeffs |>
  Map(f = (matr) matr |> as.matrix() |> 
        as.data.frame() |> filter(s1 == 0) |> rownames())
write.csv(data.frame(DS_name = DS_names_list, 
             Variables.Selected.by.glmnet = sapply(Variables.Selected, toString), 
Structural_Variables = sapply(Structural_Variables, toString),
             Variables.Not.Selected.by.glmnet = sapply(Variables.Not.Selected, toString),
Nonstructural_Variables = sapply(Nonstructural_Variables, toString)),
  file = "LASSO's Selections via glmnet for the DSs from '0.75-11-1-1 to 0.75-11-10-500'.csv", 
  row.names = FALSE)

Because for glmnet, the s is the lambda penalty term in each LASSO itself, and this is not the case for enet or lars. That meant I was arbitrarily setting the L1 Norm for every LASSO, not using cross-validation. So, I realized my error, and wrote this adjusted code which works:

grid <- 10^seq(10, -2, length = 100)
set.seed(11)     # to ensure replicability
glmnet_lasso.fits <- lapply(X = datasets, function(i) 
               glmnet(x = as.matrix(select(i, starts_with("X"))), 
                      y = i$Y, alpha = 1, lambda = grid))
Use cross-validation to select the penalty parameter
system.time(cv_glmnet_lasso.fits <- lapply(datasets, function(i) 
  cv.glmnet(x = as.matrix(select(i, starts_with("X"))), y = i$Y, alpha = 1)))
Store and print out the regression equation specifications selected by LASSO when called
lasso.coeffs = cv_glmnet_lasso.fits |> 
  Map(f = (model) coef(model, s = "lambda.min"))
Variables.Selected <- lasso.coeffs |>
  Map(f = (matr) matr |> as.matrix() |> 
        as.data.frame() |> filter(s1 != 0) |> rownames())
Variables.Selected = lapply(seq_along(datasets), (j)
                            j <- (Variables.Selected[[j]][-1]))

But when I started re-running these glmnet LASSOs on all 260,000 of my datasets again, 5 or 10k at a time, I noticed their performances are noticeably poorer! For example, just running them both ways on the first 10 of my 260k datasets which are located in the folder called "ten" in the Repository, the original way I had it with an arbitrary lambda of 0.1 chosen for each of the 10 LASSOs, the resulting performance metrics were:

> performance_metrics
  True.Positive.Rate  True.Negative.Rate  False.Positive.Rate  Underspecified.Models.Selected
                   1                0.97                 0.03                               0
  Correctly.Specified.Models.Selected  Overspecified.Models.Selected
                                    4                             6
  All.Correct..Over..and.Underspecified.Models  Models.with.at.least.one.Omitted.Variable
                                            10                                          0
  Models.with.at.least.one.Extra.Variable
                                        6

But when I re-do it the proper way using cross-validation, these are my performance metrics:

> performance_metrics
  True.Positive.Rate  True.Negative.Rate  False.Positive.Rate  Underspecified.Models.Selected
                   1               0.819                0.181                               0
  Correctly.Specified.Models.Selected Overspecified.Models.Selected
                                    1                             9
  All.Correct..Over..and.Underspecified.Models  Models.with.at.least.one.Omitted.Variable
                                            10                                          0
  Models.with.at.least.one.Extra.Variable
                                        9

Have a made another mistake in my code, but a different one this time somehow?

p.s. A quote from the README:

Each dataset name is of the general form n1-N2-N3-N4.csv; where
n1 is one of the following 4 levels of multicollinearity between the set all true regressors (structural variables/factors) for that dataset: 0, 0.25, 0.5, 0.7;
N2 is the number of structural variables in that dataset (this is known with certainty for each synthetic dataset by construction which is the reason to create them via Monte Carlo methods in order to use them to explore the properties of new algorithms or estimators and/or compare the performances of such things against standard Benchmarks as I am doing here for this project) which can be any one of 13 different possible values from 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15;
N3 is the Error Variance which is one of the 10 following monotonically increasing (integer) values: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10; and lastly,
N4, which Dr. Davies decided to set to 500 for this project, is the number of different synthetic datasets with all three of the aforementioned character traits fixed to generate randomly. Just to make that last point a little clearer, 4 * 13 * 10 * 500 = 260,000 datasets.

I would worry about using your own single grid for the lambda values to evaluate over all data sets. The glmnet() function can do a pretty good job of finding a grid and adapting the grid to the data set at hand. Also, don't overestimate the ability of LASSO to find the "true" model; it often can't. If lambda.min is returning too many non-zero coefficients for this type of data, consider using lambda.1se from cv.glmnet() in parallel for comparison. — EdM, May 30 '23 at 15:37
@EdM in retrospect, I think you are right. What I did was ran my code without any grid explicitly specified, then with the generic sized explicit grid you see here for the first subset of 5,000 datasets and got identical performance metrics and variables selected, so I just kept it in thinking it would make it slightly more readable for journal referees and whatnot. I will go ahead and try lambda.lse and compare the results, thanks again! — Marlen, May 30 '23 at 19:54
@EdM I just uploaded 3 things to the Stage 2 Results folder in the Repo, two csvs and an xlx file, the latter is called glmnet's selections with vs without grid explicitly included for the DSs from '0.75-15-1-1 to 0.75-15-10-500 and its name is self explanatory. But I am sure you trust me enough to write a comparative IF function in Excel. I compared the variables selected for a different set of 5k datasets both with and without the grid argument included and they were identical for all 5k cases. — Marlen, May 31 '23 at 14:17
@EdM using lambda.1se instead makes an ENORMOUS difference, wow! And to the point in question, yes, its performance scores are slightly superior to the version I ran originally which was arbitrary on the 1st subset of 10k datasets, and the number of correctly specified models overall it selects is much higher as well. — Marlen, May 31 '23 at 19:18
Don't forget that the superiority of lambda.1se is most likely to be seen when the true model has few non-zero coefficients, as seems to be the case here by construction, and when it's important to try to identify the "true" model (which you often can't do with messy real-world data, anyway). For predictive models you might be better off with lambda.min in LASSO even if that includes some "false positive" coefficients, or with elastic net or even ridge regression. — EdM, May 31 '23 at 20:33
@EdM yes of course, this is why I included a long paragraph from the README as a post script on this question explaining that there are only ever a minimum of 3 up to a maximum of 15 structural variables (true factors) in each dataset out of a total of 30 candidate variable columns. For me, the problem is determining how many does "only a few" structural variables mean? It isn't obvious to me how to figure that out.
Also, lambda.1se did WAY better than lambda.min. But it performed a bit worse than my original code with the lambda hard coded as 0.1. Strange. — Marlen, Jun 02 '23 at 17:06
@EdM I just put up another follow up about the fact that the performance is still worse than the original version which was arbitrary if you care to take a crack at it or let me know how I can improve the question: https://stats.stackexchange.com/questions/617875/interpreting-variable-selection-performance-on-n-datasets-using-n-glmnet-lassos — Marlen, Jun 06 '23 at 00:23

score 3 · Accepted Answer · answered May 30 '23 at 17:30

There is some statistical content underlying this question, having to do with what choice to make for the penalty factor when fitting a LASSO model. If you know that the "correct" model is sparse, lambda.min returned by cv.glmnet() might not be the best choice.

To illustrate the points in my comment, examine what happens with a single data set. First, let cv.glmnet() work on its own.

lDat <- read.csv("~/Downloads/0-3-1-1.csv",skip=2)
library(glmnet)
set.seed(11)
cv0311 <- cv.glmnet(x=as.matrix(lDat[2:31]),y=lDat$Y)
cv0311
#
# Call:  cv.glmnet(x = as.matrix(lDat[2:31]), y = lDat$Y) 
#
# Measure: Mean-Squared Error 
#
#     Lambda Index Measure      SE Nonzero
# min 0.03104    39  0.9792 0.06091      16
# 1se 0.13752    23  1.0316 0.05846       3

In this case, if you use minimum mean-square error as the criterion, you get lambda.min = 0.03104 and retain 16 of the 30 coefficients. There is no absolute necessity to make that choice, however. The same single invocation of the function also provides the highest penalty within one standard error of the minimum, for lambda.1se = 0.13752 and only 3 non-zero coefficients. If you know that the "correct" model is sparse (as these data seems to be constructed), then lambda.1se can be more appropriate.

I suspect that the arbitrary penalty values in your prior invocations of LASSO-related functions came closer to the lambda.1se values and thus retained fewer coefficients than what you have found with lambda.min.

The next point is more specifically about implementation. If you pre-specify the grid of lambda values to evaluate instead of letting the function choose, you might be getting into trouble.

grid <- 10^seq(10, -2, length = 100)
set.seed(11)
cv0311FixedGrid <- cv.glmnet(x=as.matrix(lDat[2:31]),y=lDat$Y,lambda=grid)
cv0311FixedGrid
#
# Call:  cv.glmnet(x = as.matrix(lDat[2:31]), y = lDat$Y, lambda = grid) 
#
# Measure: Mean-Squared Error 
#
#      Lambda Index Measure      SE Nonzero
# min 0.03054    96  0.9792 0.06095      17
# 1se 0.12328    91  1.0227 0.05798       3

You retain about the same number of coefficients at each of the final lambda choices as previously, but note that the Index in your grid at which you identify those choices is near the top of your 100 values, unlike what happens when you just let cv.glmnet() specify its own data-driven grids. I would worry that your best penalty choices will be "off the grid" with the same fixed grid on other data sets.

The problem with grids being too coarse or not covering a wide enough range is also very common in applications of kernel methods such as the SVM. I use Nelder-Mead simplex instead, I wonder why gradient free optimisation methods other than grid search are not more widely used. — Dikran Marsupial, Jun 05 '23 at 07:51

Why are the performance of LASSOs implemented via glmnet in R going down after cross-validation?

This stores and prints out all of the regression

equation specifications selected by LASSO when called

Use cross-validation to select the penalty parameter

Store and print out the regression equation specifications selected by LASSO when called

1 Answers1

Linked