Different sets of features selected by three different functions in R for running LASSO Regressions despite the same random seed for each

Question

The GitHub Repository for this research project has all of the code included in this question. Brief background context: I am just finishing up the work on my part as a coauthor on a research project exploring the properties of a newly proposed Automated Optimal Variable Selection algorithm. My role is to decide which existing automated feature selection techniques to use as the benchmarks, then run them and evaluate their performance using standard classification performance metrics in R. I ended up choosing 3 benchmarks: Backward Elimination Stepwise Regression, Forward Selection Stepwise Regression, and LASSO Regression.

I have already done this, but in the process I ran into something most unexpected which I have not been able to find any journal articles, R documentation, or textbook sub sections to explain. My collaborator randomly generated 260,000 synthetic datasets for me to run the benchmarks on, so I ran 260k LASSO Regressions (initially using the enet() function from R's elastic net package), one for each dataset, got my results, and calculated my performance metrics. And then, I did so again using the glmnet() function from the glmnet package and found that the set of variables it selected in each dataset was not always identical to the set of variables selected by enet on the same dataset even when using the same random seed value for both.

So, at that point, being exasperated, I threw my hands up and I re-did all again for a third time using the lars() function from the package of the same name and once again, the variables selected were slightly different. In the aforementioned Repository on GitHub, in the Stage 2 Results folder, the fact that each of these selected different variables can be verified by inspecting LASSO's Selections via glmnet.xlsx, LASSO's Selections via lars.xlsx, Variables Selected by LASSO ran via enet.xlsx, or Overall LASSO Performance Metrics.xlsx.

To take an example at random per a helpful suggestion below in the comments, for dataset 0.25-3-1-1, these are each of their sets of selected factors respectively:

enet: X5, X21, X22
lars: X21, X22
glment: X5, X21, X22, X30

And for completeness with respect to this example, the correct set of regressors, i.e. the structural variables for the 0.25-3-1-1 dataset is: X5, X21, X22

Which means only enet got the exact right answer (a True Positive Rate of 1 and a True Negative Rate of 1).

Here is the code I used to run them via the enet function:

set.seed(11)     
enet_LASSO_fits <- lapply(datasets, function(i) 
               elasticnet::enet(x = as.matrix(dplyr::select(i, 
                                         starts_with("X"))), 
               y = i$Y, lambda = 0, normalize = FALSE))
This separates out and stores just the coefficient estimates from each LASSO.
LASSO_Coeffs <- lapply(enet_LASSO_fits, 
                          function(i) predict(i, 
                                           x = as.matrix(dplyr::select(i, starts_with("X"))), 
                                           s = 0.1, mode = "fraction", 
                                           type = "coefficients")[["coefficients"]])
Write my own custom lapply which will separate out and return a
new list containing just the Independent Variables
which are 'selected' or chosen for each individual dataset.
IVs_Selected <- lapply(LASSO_Coeffs, function(i) names(i[i > 0]))

Here is the code for running them via the glmnet function:

set.seed(11)     
glmnet_lasso.fits <- lapply(datasets, function(i) 
               glmnet(x = as.matrix(select(i, starts_with("X"))), 
                      y = i$Y, alpha = 1))
This stores and prints out all of the regression
equation specifications selected by LASSO when called
lasso.coeffs = glmnet_lasso.fits |> 
  Map(f = (model) coef(model, s = .1))
Variables.Selected <- lasso.coeffs |>
  Map(f = (matr) matr |> as.matrix() |> 
       as.data.frame() |> filter(s1 != 0) |> rownames())
Variables.Selected = lapply(seq_along(datasets), (j)
                            j <- (Variables.Selected[[j]][-1]))

Here is the code for running them via the lars function:

set.seed(11)     
lars_LASSO_fits <- lapply(datasets, function(i) 
  lars(x = as.matrix(select(i, starts_with("X"))), 
         y = i$Y, type = "lasso"))
This stores and prints out all of the regression
equation specifications selected by LASSO when called
Lars.Coeffs <- lapply(lars_LASSO_fits, 
                       function(i) predict(i, 
                                           x = as.matrix(dplyr::select(i, starts_with("X"))), 
                                           s = 0.1, mode = "fraction", 
                                           type = "coefficients")[["coefficients"]])
IVs.Selected.by.Lars <- lapply(Lars.Coeffs, function(i) names(i[i > 0]))

What is going on here? Am I doing something wrong or do each of these fitting functions use different underlying stopping conditions or starting conditions or something like that?

p.s. 1 - The script I used to run my 260k LASSOs for the first time (by way of enet()) is the one in the GitHub Repo called "LASSO Regressions.R", the script in which I estimated them using the glmnet function is fittingly called "LASSO using the 'glmnet' package.R", and the one which in used lars to fit them is called "LASSO using Lars.R".

p.s. 2 - By the way, I had re-ran all of my 260k Backward and Forward Stepwise Regressions using the stepAIC() function from R's MASS package instead what I used the first time around, namely, the step() function from the stats library and all of the variables it selected were identical in every case, and as a result of that, I had no doubts that this would be the same for LASSO.

p.s. As was suggested in a comment thread with the person who answered this question, I asked a follow up question to this one afterwards.

Do these different packages actually need a seed? (If they do then there is no surprise: different codings are likely to use random numbers in different ways.) What happens to the first with different seeds? Or the second or the third? — Henry, May 23 '23 at 07:49
It may make it just a bit easier if you chose one dataset instead of 260k to illustrate the point. — Richard Hardy, May 23 '23 at 08:03
This question illustrates the types of subtle differences in implementation that can lead to such differences among results of seemingly similar functions. Also, read the discussion of "Uniqueness of the Lasso solution" on pages 19-20 of Statistical Learning with Sparsity and see if issues of non-unique solutions might be at play in your combinations of data and predictors. — EdM, May 23 '23 at 16:12
Another warning: you seem to be using Lasso at one specific value of the penalty factor in the code coef(model, s = .1). That's not how Lasso is typically used for variable selection. More usually, one would use something like cv.glmnet() to select an optimal penalty factor based on minimum cross-validated error or the 1-standard-error criterion. — EdM, May 23 '23 at 16:26
@EdM thank you! I have been mostly going off of An Introduction to Statistical Learning, Applied Predictive Analytics, and Elements of Statistical Learning, but mostly the first 2! — Marlen, May 23 '23 at 17:27
The example on pages 278-9 of the Second Edition of ISLR (Section 6.5.2) illustrates use of cv.glmnet() to choose the penalty lambda that minimizes cross-validated error. — EdM, May 23 '23 at 17:37
@EdM sure, but the entire point of me running these benchmark algorithms is to compare their performance with that of the principal author's. Which candidate factors are structural is known by construction for each dataset, so I am not doing any sample splitting or K-fold cross-validation. In practice, I of course would do this. — Marlen, May 24 '23 at 01:23
A few thoughts. First, your models don't all normalize/standardize the predictors; the default for all 3 functions is to do so, but you chose otherwise for enet. The glmnet function by default chooses a set of 100 lambda values to evaluate, while the others compute the entire sequence of fits at once if p<n (as you seem to have in your data). If your pre-ordained choice of s=.1 is between two of the glmnet choices and is close to the threshold for inclusion/exclusion of a predictor, you could find glmnet including or excluding predictors relative to the others. — EdM, May 25 '23 at 14:28
@EdM all of this is being run on randomly generated synthetic datasets which were created via Monte Carlo Simulations and by construction, all of the observations in each of them follow the standard normal distribution with standard normal errors. This is not a practical application, this is all for an important section of an academic research paper. Therefore, I can't use many things I normally would, for instance, cross-validation and they are all standardized already. I am going to go back and try them out without explicitly including any s arguments to see what happens right now though. — Marlen, May 26 '23 at 15:01
@EdM I have been quite busy but I just went back and you were right and I was confused. Pages 278-279 clarified quite a bit in the newer edition of ISLR, but I am still not sure if I will be able to use the lambda = grid argument in my case, will I? — Marlen, May 30 '23 at 00:13
If there is random numbers involved then different methods can lead to different results even if the random seed is the same, because the algorithms that generate the results can be different. An example is here: Difference drawing random numbers from distributions R. — Sextus Empiricus, May 30 '23 at 15:10

score 4 · Accepted Answer · answered May 26 '23 at 17:11

First, as a comment suggested, none of these functions uses random sampling. The data sets I examined all had the number of candidate parameters less than the number of observations. If that's the case, you can get full linear least-squares (LS) regression fits as a reference.

With lars() and the enet() function that's based on it, the functions can return the entire path of solutions from fully penalized (all coefficients 0) to that LS solution. That's explained in Chapter 3 of Elements of Statistical Learning. When you make predictions from such models with a penalty-factor choice s, you must specify what that penalty means.

You chose s = 0.1 and mode = "fraction" for those models. According to the lars manual, with that mode:

If mode="fraction", then s should be a number between 0 and 1, and it refers to the ratio of the L1 norm of the coefficient vector, relative to the norm at the full LS solution.

So in those cases you are (arbitrarily) choosing the set of coefficients along the path that has a sum of absolute values equal to one-tenth of the sum of coefficient absolute values at the LS solution. I think that you should get the same results from both types of models. You specified normalize=FALSE in the call to enet() but didn't to lars(), which might be responsible (perhaps due to numerical precision issues) for the difference you found even if the predictor variables were nominally standardized to start with.

The penalty factor in glmnet() is different. With its more general application, that function chooses by default 100 penalty parameter values. The maximum is the smallest penalty that sets all coefficients to 0, and the minimum is a small fraction of it; "the default depends on the sample size nobs relative to the number of variables nvars" according to the manual.

Furthermore, the value of s means something completely different from what you specified for the other functions. Again, quoting from the manual:

Value(s) of the penalty parameter lambda at which predictions are required.

That's the actual penalty applied to the L1 norm of the coefficients, not a fraction of the L1 norm at the LS solution as you specified to the other functions. The value used by predict.glmnet() is equivalent to mode = "penalty" for predict.enet() and mode = "lambda" for predict.lars().

Finally, I respectfully suggest that this approach with a fixed choice of s (even when you eventually specify the same type of s to all models) is not consistent with your intent to use LASSO as a "benchmark" against which to compare another predictor-selection method. Choosing an arbitrary penalty factor is not how these functions are used in practice. Typically, one would use cv.glmnet() to find an optimal penalty factor, and return the coefficients that were retained in the model at that optimal penalty.* I think you will find it very hard to convince reviewers that your approach to feature selection with LASSO, with a single fixed value of s, constitutes valid "benchmarking."

*Unlike your code, that would require random sampling and a corresponding set.seed() for reproducibility.

Well this is heart breaking, but I am glad I found out. Before submitting it for publication! I did all of this a few months ago, but just heard back from the principal author that he wants the results and had to ask this here before sending them to him. I guess I have my work cut out for me once again. — Marlen, May 29 '23 at 23:28
so should my plan of action be to re-run them all without any s arguments included, and to replace the s argument with cv.glmnet() for glmnet? — Marlen, May 29 '23 at 23:42
@Marlen yes to make this compatible with standard LASSO practice, run cv.glmnet() to find an appropriate penalty value (lambda.min would be the most appropriate choice here, I think) then use that as the penalty factor on the corresponding glmnet() fit. It's just a few extra lines of code, for which you can use the LASSO example in Section 6.5.2 of ISLR as a guide. I suppose you could also use lambda.1se if your interest is in minimizing the number of non-zero coefficients; evaluating both wouldn't appreciably extend computation time. — EdM, May 30 '23 at 02:32
I have done this and I am re-running all of my regressions by now as of 2 hours ago, but their performance is actually much worse now which doesn't make any sense. — Marlen, May 30 '23 at 03:10
Should I just ask about this as a new question at this point? — Marlen, May 30 '23 at 03:30
@Marlen I would recommend a new question showing one or two reproducible examples of what you've done, illustrating the poorer performance with the new approach. For example, I couldn't find the "dataset 0.25-3-1-1" at all in your repository. — EdM, May 30 '23 at 11:58
Oh of course! Yes, I didn't upload all 50k, I only uploaded the first 50 and the last 50, but duh, I should now also upload that dataset now that I have referenced it here. It is in the Repo now. — Marlen, May 30 '23 at 13:57

Different sets of features selected by three different functions in R for running LASSO Regressions despite the same random seed for each

This separates out and stores just the coefficient estimates from each LASSO.

Write my own custom lapply which will separate out and return a

new list containing just the Independent Variables

which are 'selected' or chosen for each individual dataset.

This stores and prints out all of the regression

equation specifications selected by LASSO when called

This stores and prints out all of the regression

equation specifications selected by LASSO when called

1 Answers1

Linked