This question is a follow-up to this question I asked here last week. I got an important and useful answer to it, but what that led to was surprising. And once again, here is a link to the GitHub Repo for this project.
I am running N LASSO Regressions, one for each of N datasets I have in a file folder to serve as one of several Benchmark Variable Selection Algorithms whose performance can be compared with that of a novel Variable Selection Algorithm being proposed by the principal author of the paper we are collaborating on right now. I originally ran my N LASSOs via the enet function, but tried to replicate my findings via glmnet and lars, each selected different variables. The way I had previously set up my code to run them using the glmnet function was fundamentally misusing the s argument in the coef() function in the middle of the following code block:
set.seed(11)
glmnet_lasso.fits <- lapply(datasets, function(i)
glmnet(x = as.matrix(select(i, starts_with("X"))),
y = i$Y, alpha = 1))
This stores and prints out all of the regression
equation specifications selected by LASSO when called
lasso.coeffs = glmnet_lasso.fits |>
Map(f = (model) coef(model, s = .1))
Variables.Selected <- lasso.coeffs |>
Map(f = (matr) matr |> as.matrix() |>
as.data.frame() |> filter(s1 != 0) |> rownames())
Variables.Selected = lapply(seq_along(datasets), (j)
j <- (Variables.Selected[[j]][-1]))
Variables.Not.Selected <- lasso.coeffs |>
Map(f = (matr) matr |> as.matrix() |>
as.data.frame() |> filter(s1 == 0) |> rownames())
write.csv(data.frame(DS_name = DS_names_list,
Variables.Selected.by.glmnet = sapply(Variables.Selected, toString),
Structural_Variables = sapply(Structural_Variables, toString),
Variables.Not.Selected.by.glmnet = sapply(Variables.Not.Selected, toString),
Nonstructural_Variables = sapply(Nonstructural_Variables, toString)),
file = "LASSO's Selections via glmnet for the DSs from '0.75-11-1-1 to 0.75-11-10-500'.csv",
row.names = FALSE)
Because for glmnet, the s is the lambda penalty term in each LASSO itself, and this is not the case for enet or lars. That meant I was arbitrarily setting the L1 Norm for every LASSO, not using cross-validation. So, I realized my error, and wrote this adjusted code which works:
grid <- 10^seq(10, -2, length = 100)
set.seed(11) # to ensure replicability
glmnet_lasso.fits <- lapply(X = datasets, function(i)
glmnet(x = as.matrix(select(i, starts_with("X"))),
y = i$Y, alpha = 1, lambda = grid))
Use cross-validation to select the penalty parameter
system.time(cv_glmnet_lasso.fits <- lapply(datasets, function(i)
cv.glmnet(x = as.matrix(select(i, starts_with("X"))), y = i$Y, alpha = 1)))
Store and print out the regression equation specifications selected by LASSO when called
lasso.coeffs = cv_glmnet_lasso.fits |>
Map(f = (model) coef(model, s = "lambda.min"))
Variables.Selected <- lasso.coeffs |>
Map(f = (matr) matr |> as.matrix() |>
as.data.frame() |> filter(s1 != 0) |> rownames())
Variables.Selected = lapply(seq_along(datasets), (j)
j <- (Variables.Selected[[j]][-1]))
But when I started re-running these glmnet LASSOs on all 260,000 of my datasets again, 5 or 10k at a time, I noticed their performances are noticeably poorer! For example, just running them both ways on the first 10 of my 260k datasets which are located in the folder called "ten" in the Repository, the original way I had it with an arbitrary lambda of 0.1 chosen for each of the 10 LASSOs, the resulting performance metrics were:
> performance_metrics
True.Positive.Rate True.Negative.Rate False.Positive.Rate Underspecified.Models.Selected
1 0.97 0.03 0
Correctly.Specified.Models.Selected Overspecified.Models.Selected
4 6
All.Correct..Over..and.Underspecified.Models Models.with.at.least.one.Omitted.Variable
10 0
Models.with.at.least.one.Extra.Variable
6
But when I re-do it the proper way using cross-validation, these are my performance metrics:
> performance_metrics
True.Positive.Rate True.Negative.Rate False.Positive.Rate Underspecified.Models.Selected
1 0.819 0.181 0
Correctly.Specified.Models.Selected Overspecified.Models.Selected
1 9
All.Correct..Over..and.Underspecified.Models Models.with.at.least.one.Omitted.Variable
10 0
Models.with.at.least.one.Extra.Variable
9
Have a made another mistake in my code, but a different one this time somehow?
p.s. A quote from the README:
Each dataset name is of the general form n1-N2-N3-N4.csv; where
n1 is one of the following 4 levels of multicollinearity between the set all true regressors (structural variables/factors) for that dataset: 0, 0.25, 0.5, 0.7;
N2 is the number of structural variables in that dataset (this is known with certainty for each synthetic dataset by construction which is the reason to create them via Monte Carlo methods in order to use them to explore the properties of new algorithms or estimators and/or compare the performances of such things against standard Benchmarks as I am doing here for this project) which can be any one of 13 different possible values from 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15;
N3 is the Error Variance which is one of the 10 following monotonically increasing (integer) values: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10; and lastly,
N4, which Dr. Davies decided to set to 500 for this project, is the number of different synthetic datasets with all three of the aforementioned character traits fixed to generate randomly. Just to make that last point a little clearer, 4 * 13 * 10 * 500 = 260,000 datasets.
lambdavalues to evaluate over all data sets. Theglmnet()function can do a pretty good job of finding a grid and adapting the grid to the data set at hand. Also, don't overestimate the ability of LASSO to find the "true" model; it often can't. Iflambda.minis returning too many non-zero coefficients for this type of data, consider usinglambda.1sefromcv.glmnet()in parallel for comparison. – EdM May 30 '23 at 15:37lambda.1seis most likely to be seen when the true model has few non-zero coefficients, as seems to be the case here by construction, and when it's important to try to identify the "true" model (which you often can't do with messy real-world data, anyway). For predictive models you might be better off withlambda.minin LASSO even if that includes some "false positive" coefficients, or with elastic net or even ridge regression. – EdM May 31 '23 at 20:33Also, lambda.1se did WAY better than lambda.min. But it performed a bit worse than my original code with the lambda hard coded as 0.1. Strange.
– Marlen Jun 02 '23 at 17:06