4

I'm currently studying Lasso regression to use it as a variable selection method.

enter image description here

When I get a plot like above, shouldn't X1 be the most important variable as it is the last variable to shrink to zero?

But what I'm getting confused is that if the optimal log Lambda is -4, X1(black line) doesn't seem to be important compared to the red, blue, pink, and green line which is above X1.

So when you want to know the variable importance, should I see it by looking at the order of variable shrinking to zero? Or should I see the coefficient values based on the optimal log lambda?

Kevin
  • 41
  • 1
    How are you thinking of "importance"? What does it mean for a variable to be important to you? – Demetri Pananos Oct 22 '23 at 22:31
  • @DemetriPananos I'm thinking the importance as how much a variable can contribute to minimize the errors in prediction. – Kevin Oct 22 '23 at 22:58
  • 2
    if that is the case, shouldn't we be examining the reduction in loss too and not just when variables enter/exit the model? – Demetri Pananos Oct 22 '23 at 23:31
  • 4
    It is difficult to be certain, but (assuming all the variables have zero mean and the same variance) perhaps $x_1$ is negatively correlated with $x_7$ and correlated with some others, and when the penalty/shrinkage is slight (small $\lambda$) the others contribute more to a better fit, possibly sometimes offsetting each other, while when the penalty/shrinkage is high, just using $x_1$ is the best you can do. There need not be a variable that is "most important" in all cases, and as you change constraints of the model, different variables get different coefficients. – Henry Oct 23 '23 at 00:02

2 Answers2

4

When each variable is used on its own X1 is the best predictor, but that is partially because it correlates with a combination of other variables, that predicts the data even better. For smaller $\lambda$ the combination is included in the model and so the coefficient of X1 falls. X7 might have the largest adjusted effect, but works as a predictor better only in the context of other variables. Depending on the correlations of the covariates they might only explain very little variation on their own, see these questions:

How can delta (Δ) R² for a term be larger than R² for that term?

Should individual $R^2$ of a predictor always be greater than $\Delta R^2$ when removing that predictor from an expanded model?

I have constructed a simple example in R, where x1 is only a proxy for the true predictor x2-x3 :

n <- 100

x2 <- rnorm(n) x3 <- rnorm(n)

x2 - x3 just to separate them in the plot

y <- x2 - x3 + rnorm(n) x1 <- (x2 - x3) + rnorm(n, sd = 0.5) anova(lm(y ~ x1), lm(y ~ x2), lm(y ~ x3)) # lower RSS -> better R2 then the other two models

library(glmnet)

mod <- glmnet(cbind(x1, x2, x3), y, alpha = 1)

plot(mod, xvar = "lambda", xlim =c(-3, 1))

X1 is initially the only variable then drops to 0 as x2, x3 get included

Lukas Lohse
  • 2,482
4

Putting aside the fact that lasso has a low probability of finding the right variables, lasso is a selection/estimation technique where the stopping rule is typically based on optimizing some sensitive criterion such as deviance in a cross-validation procedure. The stopping rule dictates which variables have coefficients set to zero. You can say that those dropped variables have importance zero. If the variables are all reasonably standardized, the importance of the selected variables will be non-zero and could be guesstimated by the absolute value of their coefficients. It is important to recognize the difficulty of the selection and estimation tasks by bootstrapping the whole process to put confidence intervals on such variable importance estimates.

Frank Harrell
  • 91,879
  • 6
  • 178
  • 397
  • 2
    It is important to recognize the difficulty of the selection and estimation tasks by bootstrapping the whole process to put confidence intervals on such variable importance estimates. It amazes me how so many applied fields will insist on p-values or confidence intervals, only to forget about uncertainty when machine learning calculates feature importance. If you wouldn't accept a mean without some kind of test or interval estimate, why is feature importance exempt from this? – Dave Oct 24 '23 at 16:50