1

This question is a follow-up to this prior question on here from a week ago which was itself a follow-up to this prior question. Here is a link to the GitHub Repository for the project.

I am running N LASSO Regressions, one for each of N datasets I have in a file folder to serve as one of several Benchmark Variable Selection Algorithms whose performance can be compared with that of a novel Variable Selection Algorithm being proposed by the principal author of the paper we are collaborating on right now. I originally ran my N LASSOs via the enet function, but tried to replicate my findings via glmnet and lars, each selected different variables. The way I originally had my R code set up my to run them using the glmnet function was fundamentally misusing the s argument in the coef() function in the middle of the following code block because I hardcoded it and not only that, I did so seemingly arbitrarily:

set.seed(11)     
glmnet_lasso.fits <- lapply(datasets, function(i) 
               glmnet(x = as.matrix(select(i, starts_with("X"))), 
                      y = i$Y, alpha = 1))

This stores and prints out all of the regression

equation specifications selected by LASSO when called

lasso.coeffs = glmnet_lasso.fits |> Map(f = (model) coef(model, s = .1))

Variables.Selected <- lasso.coeffs |> Map(f = (matr) matr |> as.matrix() |> as.data.frame() |> filter(s1 != 0) |> rownames()) Variables.Selected = lapply(seq_along(datasets), (j) j <- (Variables.Selected[[j]][-1]))

Variables.Not.Selected <- lasso.coeffs |> Map(f = (matr) matr |> as.matrix() |> as.data.frame() |> filter(s1 == 0) |> rownames())

write.csv(data.frame(DS_name = DS_names_list, Variables.Selected.by.glmnet = sapply(Variables.Selected, toString), Structural_Variables = sapply(Structural_Variables, toString), Variables.Not.Selected.by.glmnet = sapply(Variables.Not.Selected, toString), Nonstructural_Variables = sapply(Nonstructural_Variables, toString)), file = "LASSO's Selections via glmnet for the DSs from '0.75-11-1-1 to 0.75-11-10-500'.csv", row.names = FALSE)

The reason I accidentally set s = 0.1, which means lambda = 0.1, for all LASSOs is because I was using both lars's and glmnet's ability to run LASSO regressions in order to try to replicate/reproduce the N LASSO Regressions I had already ran on my N data.tables stored in the Large List object 'datasets' in that order, and because I was trying to precisely replicate the regression equations selected for each dataset, I set s = 0.1 in both of them as well without checking to make sure the s argument in both of them means the same thing as it does where I have it in the code included above from the R script called "glmnet LASSO (Regressions only)", located in Stage 2 scripts folder in the GitHub Repo.

When I began to implement the suggested alternative way to do select the lambda for each dataset, namely, via cross-validation, from the answer to my aforementioned previous question about this issue I posted here on (which is a very appropriate given that the name of this forum is Cross Validated), I initially thought it was going to perform better (obtain higher performance metrics and select more "correctly specified" regression equations) overall across all of my 260,000 synthetic sample datasets just because it performed SIGNIFICANTLY better on the first subset of 10,000 of those datasets I ran it on (see the two screenshots below the updated code block below this paragraph) and I naively extrapolated from there in my head without checking again until I finished re-running all of them half a week later.

Here is the code that selecting the lambda for each LASSO via cross-validation rather than just setting it equal to 0.1 at the outset:

# This function fits all n LASSO regressions for/on
# each of the corresponding n datasets stored in the object
# of that name, then outputs standard regression results which 
# are typically called returned for any regression ran using R
    set.seed(11)     # to ensure replicability
system.time(glmnet_lasso.fits <- lapply(X = datasets, function(i) 
               glmnet(x = as.matrix(select(i, starts_with("X"))), 
                      y = i$Y, alpha = 1)))

Use cross-validation to select the penalty parameter

system.time(cv_glmnet_lasso.fits <- lapply(datasets, function(i) cv.glmnet(x = as.matrix(select(i, starts_with("X"))), y = i$Y, alpha = 1)))

Store and print out the regression equation specifications selected by LASSO when called

lasso.coeffs = cv_glmnet_lasso.fits |> Map(f = (model) coef(model, s = "lambda.1se"))

Variables.Selected <- lasso.coeffs |> Map(f = (matr) matr |> as.matrix() |> as.data.frame() |> filter(s1 != 0) |> rownames())

Variables.Selected = lapply(seq_along(datasets), (j) j <- (Variables.Selected[[j]][-1]))

Variables.Not.Selected <- lasso.coeffs |> Map(f = (matr) matr |> as.matrix() |> as.data.frame() |> filter(s1 == 0) |> rownames())

write.csv(data.frame(DS_name = DS_names_list, Variables.Selected.by.glmnet = sapply(Variables.Selected, toString), Structural_Variables = sapply(Structural_Variables, toString), Variables.Not.Selected.by.glmnet = sapply(Variables.Not.Selected, toString), Nonstructural_Variables = sapply(Nonstructural_Variables, toString)), file = "LASSO's Selections via glmnet for the DSs from '0.75-11-1-1 to 0.75-11-10-500'.csv", row.names = FALSE)

And here are screenshots showing the mean classification performance metrics for N glmnet LASSOs estimated on the same N datasets using lambda = 0.1 for each and every one of them vs a lambda value chosen via cross-validation for each dataset. I am interpreting their performance in terms of how good they are at selecting the structural model, so I calculate the Accuracy, F1 Score, True Positive Rate aka sensitivity, True Negative Rate aka specificity, and False Positive Rates for each selected model, then using them to determine whether or not the model selected is underspecified, i.e. it suffers from omitted variable bias, correctly specified (self explanatory), or overspecified, i.e. includes all structural factors and at least one extraneous (nonstructural) variable. What is in the Excel Workbook are the mean values of all N of these performance metrics for all N datasets and the count of all under/correct/over-specified models selected.
glmnet(x, y, alpha = 1) + coef(s = lambda = 0.1): enter image description here enter image description here

cv.glmnet(x, y, alpha = 1) + coef(s = "lambda.1se"): enter image description here enter image description here

Unfortunately, this is not a mixed story where for instance, some of the performance metrics are better when s = 0.1 than s = lambda.1se and some are worse, no, it's selections perform better across the board when s = 0.1.

How should I interpret this result? Where I'm from, the difference between correctly getting 56,229 (out of 260,000, i.e., 22%) is quite a bit different than 38,764 (out of 260,000, i.e., 15%). Should I maybe try to do it all again but with s = 0.15 or s = 0.2, 0.25, or some other amount?

p.s. This is the actual description of the custom macro used to create all of the N synthetic datasets written by its author:

  1. Construct a random data set composed of N regressors, each distributed standard normal, and standard normal error. Construct a dependent variable as a linear function of K of the regressors and the error.
  2. Run ERR, stepwise, (whatever other methods) on the data set in an attempt to find the “best fit” model (according to each procedures’ criterion for “best”).
  3. Repeat 1-2 10,000 times.
  4. Repeat 1-3 with error variances 2, 3, 4, 5, 6, 7, 8.
  5. Repeat 1-4 with different values for K.
  6. Repeat 1-5, but this time create the N regressors with randomly selected cross-regressor correlations.

Filename: D-C-B-A.xlsx D = Correlation between regressors x 10 (e.g., 1 = 0.1 correlation) C = # of regressors (k) B = Error variance A = data set (1 through 10,000)

p.s.2. The performance metrics in the csv files were calculated using the rest of the R script in question which I include here for completeness, but rest assured, they have already been independently verified many times over by now:

### Count up how many Variables Selected match  the true 
### structural equation variables for that dataset in order
### to measure LASSO's performance.
# all of the "Positives", i.e. all the Structural Regressors
BM1_NPs <- lapply(Structural_Variables, function(i) { length(i) })

all of the "Negatives", i.e. all the Nonstructural Regressors

BM1_NNs <- lapply(Structural_Variables, function(i) {30 - length(i)})

all the "True Positives"

BM1_TPs <- lapply(seq_along(datasets), (i) sum(Variables.Selected[[i]] %in% Structural_Variables[[i]]))

the number True Negatives for each LASSO

BM1_TNs <- lapply(seq_along(datasets), (k) sum(Variables.Not.Selected[[k]] %in% Nonstructural_Variables[[k]]))

the number of False Positives

BM1_FPs <- lapply(seq_along(datasets), (i) sum(Variables.Selected[[i]] %in% Nonstructural_Variables[[i]]))

the number of False Negatives Selected by each LASSO

BM1_FNs <- lapply(seq_along(datasets), (i) sum(Variables.Not.Selected[[i]] %in% Structural_Variables[[i]]))

the True Positive Rate

BM1_TPR = lapply(seq_along(datasets), (j) j <- (BM1_TPs[[j]]/BM1_NPs[[j]]) )

the False Positive Rate = FP/(FP + TN)

BM1_FPR = lapply(seq_along(datasets), (j) j <- (BM1_FPs[[j]])/(BM1_FPs[[j]] + BM1_TNs[[j]]))

the True Negative Rate = TN/(FP + TN)

BM1_TNR <- lapply(seq_along(datasets), (w) w <- (BM1_TNs[[w]]/(BM1_FPs[[w]] + BM1_TNs[[w]])))

calculate the accuracy and F1 score with help from GPT 4

BM1_Accuracy <- lapply(seq_along(datasets), function(i) (BM1_TPs[[i]] + BM1_TNs[[i]])/(BM1_TPs[[i]] + BM1_TNs[[i]] + BM1_FPs[[i]] + BM1_FNs[[i]]))

First calculate precision and TPR for each dataset

BM1_Precision <- lapply(seq_along(datasets), function(i) BM1_TPs[[i]]/(BM1_TPs[[i]] + BM1_FPs[[i]]))

Then calculate F1 score for each dataset

BM1_F1_Score <- lapply(seq_along(datasets), function(i) 2 * (BM1_Precision[[i]] * BM1_TPR[[i]])/(BM1_Precision[[i]] + BM1_TPR[[i]]))

Write one or more lines of code which determine whether each selected

model is "Underspecified", "Correctly Specified", or "Overspecified".

True Positive Rates as a vector rather than a list

TPR <- unlist(BM1_TPR) mean_TPR <- round(mean(TPR), 3)

number of selected regressions with at least one omitted variable

num_OMVs <- sum(TPR < 1, na.rm = TRUE)

False Positive Rate as a vector rather than a list

FPR <- unlist(BM1_FPR) mean_FPR <- round(mean(FPR), 3) num_null_FPR <- sum(FPR == 0, na.rm = TRUE)

number of models with at least one extraneous variable selected

num_Extraneous <- sum(FPR > 0, na.rm = TRUE)

True Negative Rates as a vector rather than a list

TNR <- unlist(BM1_TNR) TNR2 <- unlist(BM1_TNR2) mean_TNR <- round(mean(TNR), 3) mean_TNR2 <- round(mean(TNR2), 3)

The Accuracy as a vector rather than a list

Acc <- unlist(BM1_Accuracy) mean_Accuracy <- round(mean(Acc), 3)

The F1 Scores as a vector rather than a list

F1 <- unlist(BM1_F1_Score) mean_F1 <- round(mean(F1), 3)

Number of Underspecified Regression Specifications Selected by LASSO

N_Under = sum( (TPR < 1) & (FPR == 0) )

Number of Correctly Specified Regressions Selected by LASSO

N_Correct <- sum( (TPR == 1) & (TNR == 1) )

Overspecified Regression Specifications Selected by LASSO

N_Over = sum( (TPR == 1) & (FPR > 0) )

sum of all the 3 specification categories

Un_Corr_Ov = N_Under + N_Correct + N_Over

Headers <- c("Mean Accuracy", "Mean F1 Score", "Mean True Positive Rate", "Mean True Negative Rate", "Mean False Positive Rate") PMs1 <- data.frame(mean_Accuracy, mean_F1, mean_TPR, mean_TNR, mean_FPR) colnames(PMs1) <- Headers

Headers <- c("Underspecified Models Selected", "Correctly Specified Models Selected", "Overspecified Models Selected") PMs2 <- data.frame(N_Under, N_Correct, N_Over) colnames(PMs2) <- Headers

Headers <- c("All Correct, Over, and Underspecified Models", "Models with at least one Omitted Variable", "Models with at least one Extra Variable") PMs3 <- data.frame(Un_Corr_Ov, num_OMVs, num_Extraneous) colnames(PMs3) <- Headers

Or, just print out this instead of having to print out 3 different things

performance_metrics <- data.frame(PMs1, PMs2, PMs3)

write.csv(performance_metrics, file = "glmnet's Performance on the datasets from '0.75-11-1-1 to 0.75-11-10-500'.csv", row.names = FALSE)

Marlen
  • 147
  • 1
    is cv going to be first author on this paper ;)? – John Madden Jun 05 '23 at 12:25
  • No, we'll definitely mention him in a footnote though haha. – Marlen Jun 05 '23 at 23:57
  • Am I understanding the results correctly: all the returned models at lambda.1se and at s=0.1 actually included all of the true ""structural" variables, and the differences are in how many more variables were returned at lambda.1se? – EdM Jun 06 '23 at 13:29
  • There don't seem to be many high pairwise correlations among the predictors, even at the highest correlation setting of 0.75. I've edited the answer to demonstrate that and to show the glmnet plot for an allegedly high-correlation data set. – EdM Jun 10 '23 at 17:19

1 Answers1

1

The nature of the synthetic data set:

Each dataset name is of the general form n1-N2-N3-N4.csv; where n1 is one of the following 4 levels of multicollinearity between the set all true regressors (structural variables/factors) for that dataset: 0, 0.25, 0.5, 0.7; N2 is the number of structural variables in that dataset ... which can be any one of 13 different possible values from 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15; N3 is the Error Variance which is one of the 10 following monotonically increasing (integer) values: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10; ...

probably explains this behavior.

Although it's now clarified that both "true" predictors and "extra" predictors within any data set can be correlated, the correlations might not be very high. That means a particular (and maybe not very realistic) sparsity structure is imposed from the beginning. In real-world data, high correlations between "true" outcome-associated predictors and "extra" predictors might be expected. In that case, it can be hard to distinguish which retained predictors have "true" associations with outcome and which are maintained just due to correlations with other predictors.

These synthetic data don't seem to have very high correlations, however. Here's an example from data set 0.75-15-10-450.csv, which based on the above should have the highest correlations (0.75), the largest number of "true" predictors (15), and the highest error variance (10).

csv1 <- read.csv("~/Downloads/0.75-15-10-450.csv",skip=2)
cormat <- cor(csv1[,2:31]) ## correlation matrix among predictors
plot(density(cormat[upper.tri(cormat)])) ## all pairwise correlations

distribution of pairwise correlations

Most of the pairwise correlations are close to 0. Even at this highest correlation setting, most of the variance inflation factors aren't terribly high:

car::vif(lm(Y~.,data=csv1))
#        X1        X2        X3        X4        X5        X6        X7        X8 
# 11.428216 17.928297 11.946471  8.439049  6.370595  6.043061  4.610803  4.821657 
#        X9       X10       X11       X12       X13       X14       X15       X16 
#  4.139390  4.221097  3.774396  4.057433  4.702758  4.070381  3.585803  3.946377 
#       X17       X18       X19       X20       X21       X22       X23       X24 
#  4.314557  4.284794  4.159973  4.214894  4.173159  4.010739  4.308400  4.280168 
#       X25       X26       X27       X28       X29       X30 
#  4.079863  3.895387  3.972616  4.161915  3.874154  2.533185 

In this situation, LASSO should readily be able to penalize the "extra" predictors preferentially. With this type of synthetic data, if you start from maximum penalty and decrease it systematically I thus suspect that most or all of the "true" predictors will be added to the model before any of the "extra" predictors. In that case a penalty even higher than what would be imposed at lambda.1se is likely to do the best at keeping all the "true" predictors in the model while omitting almost all of the "extra" ones.

Here are examples from extremes of the combinations of correlations, numbers of "true" predictors, and error variances. The 0-3-1-1.csv file is near the lowest complexity, 0.75-15-10-450.csv is near the highest.

In the following plots, the penalty is decreasing continuously from left to right, showing the corresponding changes in coefficient values (left axis) and the number of retained coefficients on the top.

For 0-3-1-1.csv:

predictor entry plot for simple case

The file name indicates 3 "true" predictors, all of which enter the model almost immediately as the penalty is relaxed. There is an enormous range of penalties within which the same predictors would be retained (with progressively higher coefficient values) before any of the "extra" predictors are included.

For 0.75-15-10-450.csv:

predictor entry plot for complex case

If you investigate the details of the fit as a function of penalty, you will see at the lowest penalty values that all 15 "true" predictors have coefficient values greater than about 0.5 while those of all "extra" predictors are about 0.1 or less. Although at very high penalties an "extra" predictor (predictor 8) entered the model very early, its coefficient magnitude decreased as more "true" predictors entered. One "true" predictor (predictor 30) entered fairly late as the penalty was decreased, but it eventually had the highest of all coefficient values.

It doesn't seem that these synthetic data pose much of a challenge to the ability of LASSO to include all of the "true" predictors. The question is how many "extra" predictors are also included at any choice of penalty.

This plot highlights an additional problem: focus on how many predictors are maintained omits consideration of the associated coefficient values. When "extra" predictors do eventually get included in the model, their coefficients are low in magnitude and won't much affect predictions made from the model.

The above is completely consistent with the tabulated results you show. As I read the tables, all models managed to retain all "true" predictors. That's pretty remarkable. The differences are in how many "extra" predictors were also retained. With this type of data, the presumably higher penalty at s=0.1 than the cross-validated choice of lambda.1se removes even more of the "extra" predictors.

Finally, I'd recommend against using the fraction of "true" models as a metric. LASSO can't really be expected to find a "true" model, just one that works well enough based on its "bet on sparsity" (SLS, page 2). For distinguishing the ability to omit "extra" predictors you might be better off using something like your "false positive" metric, which at first glance might have some useful associations with one or more of multicollinearity, number of "true" variables, and error variance. Maybe even better, compare the L1 norm of the "true" predictors in a model against that of the retained "extra" predictors to get a better estimate of the practical significance of including the "extra" predictors.

Code for first glmnet plot

library(glmnet)
lDat <- read.csv("~/Downloads/0-3-1-1.csv",skip=2)
glmnet1 <- glmnet(x=as.matrix(lDat[2:31]),y=lDat$Y)
plot(glmnet1,label=TRUE)

EdM
  • 92,183
  • 10
  • 92
  • 267
  • thank you for answering again, just to be as careful as possible because my rephrasing of the procedure used to create the datasets may not be right, here is the description written by the principle researcher who created them:
    1. Construct a random data set composed of N regressors, each distributed standard normal, and standard normal error. Construct a dependent variable as a linear function of K of the regressors and the error
    2. Run ERR, stepwise, (whatever other methods) on the data set in an attempt to find the “best fit” model (according to each procedures’ criterion for “best”)
    – Marlen Jun 09 '23 at 20:45
  • ... continued: 3. Repeat 1-2 500 times. 4. Repeat 1-3 with error variances 2, 3, 4, 5, 6, 7, 8. 5. Repeat 1-4 with different values for K. 6. Repeat 1-5, but this time create the N regressors with randomly selected cross-regressor correlations.

    Filename: D-C-B-A.xlsx D = Correlation between regressors x 10 (e.g., 1 = 0.1 correlation) C = # of regressors (k) B = Error variance A = data set (1 through 500)

    – Marlen Jun 09 '23 at 20:45
  • @Marlen the question is whether the "randomly selected cross-regressor correlations" were restricted to the "true" predictors or all 30 potential predictors. The problem will be more dramatic if they were restricted to the"true" predictors, but I suspect that it will still be a problem if all 30 predictors might be allowed to be correlated. You can check by examining a few representative data sets. The fact that all "true" predictors were returned in all 260,000 models means that the only issue is how many "extra" predictors happen to be retained by the different methods and penalty choices. – EdM Jun 10 '23 at 02:39
  • Excellent, I thought that what what you meant, but I wanted to double check with you first before asking the principal author. I understand your critique about ALL 260k models being either correctly specified or overspecified being a big initial red flag for you, it was for me too back in October. But, a similar result happened with either Backward or Forward Stepwise Regression, or possible even both, and at that point, I just thought I didn't understand something. – Marlen Jun 10 '23 at 03:23
  • I got a reply, here is what he said "The regressors, all 30 of them, are pair-wise correlated. IOW, true regressors are correlated with each other, true regressors are correlated with false regressors, and false regressors are correlated with each other." To me, this sounds like there is cross correlation among all 30 candidate regressors, but it is pair-wise. – Marlen Jun 10 '23 at 13:38
  • 1
    @Marlen even at the highest correlations, and even with all predictors intercorrelated, there don't seem to be very high correlations. For example, in 0.75-15-10-450.csv (which has the highest correlations, "true" predictor numbers and error variance), the highest variance inflation factor (VIF) is about 18, and only 3 out of 30 predictors have a VIF greater than 10. If you apply my code to that data set, at the lowest penalty levels all of the "true" predictors have coefficient values greater than about 0.5 while those of all the "extra" predictors are about 0.1 or less. – EdM Jun 10 '23 at 15:43
  • ahh, I am finally seeing the big picture of your feedback now after going back and re-reading the original LASSO paper last night. I also immediately noticed how alarmingly well LASSO was performing back when I first got my code to run in an earlier phase of this work last October on a smaller set of 58.5k synthetic datasets which were constructed differently, but for those, the 'true' regressors were always the 1st ones which is unfairly suited towards Forward Stepwise, another one of the Benchmark methods, so at least now the data doesn't have that problem anymore. – Marlen Jun 12 '23 at 17:07
  • but the issue is this: lasso is just one of the three benchmark variable selection algorithms I am running to have something the principal author can compare the performance of the novel variable selection he algorithm he is proposing with. I have not had any issues like this with the other two benchmarks, Backward Elimination Stepwise & Forward Selection Stepwise Regression. So it would be awkward or incoherent to change how I measure of success of lasso. – Marlen Jun 12 '23 at 17:12
  • 1
    @Marlen forward stepwise shouldn't depend on the order of presentation of predictors to the model, just their individual associations with outcome. The difference is that stepwise methods have fixed rules for stopping (typically based on AIC), while LASSO allows for a wide range of penalty choices whose values will determine its "success," however measured. I'm not sure that the fraction of "true" models is appropriate for any of the approaches, including for the "novel algorithm," but if that's how the "novel algorithm" is being assessed I guess that you are stuck with it. – EdM Jun 12 '23 at 17:58
  • based on some of your feedback, I went back and re-read the original LASSO paper and in my physical copy, this is on page 19, but https://webdoc.agsci.colostate.edu/koontz/arec-econ535/papers/Tibshirani%20(JRSS-B%201996).pdf here it is page 17 or 282, in the first benchmark comparison example, the author explicitly compares lasso's performance with best subset selection & others in very similar terms to what we are doing: ": although the correct model (1, 2, 5) was chosen only 2.5% of the time, the selected model contained (1, 2, 5) 95.5% of the time." – Marlen Jun 18 '23 at 18:24
  • However, I have gone back and added in extra lines of code to calculate more performance metrics than just the TPR, TNR, and FPR as I originally had. Now I also calculate the Accuracy, F1 Score, Positive Predictive Value, and False Negative Rate for each model as well. – Marlen Jun 18 '23 at 18:28