How does lasso regularization select the "less important" features?

Question

I'm just starting in machine learning and I can't figure out how does lasso method find which features are redundant to shrink their coefficients to zero?

Your mistaken - lasso doesnt find the redundant features - it just shrinks features to zero somewhat arbitrarily - you need.to use another method like cross validation or scoring to determine which features are best — Xavier Bourret Sicotte, Aug 25 '18 at 09:59

knrumsey · Accepted Answer · 2018-08-14T16:30:59.313

7

There are many ways to think about regularization. I find the restricted optimization formulation to be quite intuitive.

$$\hat\beta = \arg\min_\beta \|\bf y - X\bf\beta\|$$ $$\text{subject to } \|\bf \beta\| \leq \lambda_\star$$

Usually in the first line, we use the L2 norm (squared), which corresponds to Ordinary Least Squares estimates. The restriction gives a way of "shrinking" the estimates back to the origin.

If an L2 norm is used in the second line, we have Ridge Regression, which effectively pulls the OLS estimates back towards the origin. Useful in many situations, but not very good at setting estimates equal to 0.

If an L1 norm is used in the second line, we get LASSO. This is good for "feature selection" since it is able to effectively set inert or nearly-inert features to 0.

The following image illustrates this nicely in two dimensions.

Figure from Elements of Statistical Learning by Hastie, Tibshirani, and Friedman

edited Aug 14 '18 at 16:30

answered Aug 10 '18 at 18:01

knrumsey

7,722

@knrumsey How do we know that ellipses have these orientations? I mean we can imaging ellipses that hit the circle (ridge case) on point $(0, 1)$ and as such ridge would also perform feature selection. In other words, can we in general show that whatever the solution $\hat{\boldsymbol{\beta}}$ of (unpenalized) least squares is, then the contours would have such orientation so that they hit a corner on lasso? – ado sar Feb 22 '23 at 11:29
Also how intercept affect the ellipses. The least square solution $\hat{\boldsymbol{\beta}}$ is equal to $(\hat{\beta_0}, \hat{\beta_1}, \hat{\beta_2})$, not just $(\hat{\beta_1}, \hat{\beta_2})$. – ado sar Feb 24 '23 at 17:34
@adosar 1. The ellipses orientation roughly depicts the covariance of the sampling distributions of $\hat\beta$. 2. Generally, the only way that this can happen is if $\hat\beta_2$ is already $0$. It can be shrunk to nearly zero (as seen in the figure), but won't be exactly zero. You should be able to convince yourself of this geometrically. 3. It is common in the regularization literature to first centralize the variables (mean zero) to enforce $\hat\beta_0 = 0$. – knrumsey Feb 28 '23 at 19:23

score 5 · Answer 2 · answered Sep 02 '18 at 09:09

how does lasso method find which features are redundant to shrink their coefficients to zero?

The Lasso method on its own does not find which features to shrink. Instead, it is a combination of Lasso and Cross Validation (CV) which allows us to determine the best Lasso parameter $\lambda$.

Different $\lambda$ may shrink different parameters to zero, and choosing the right $\lambda$ can sometimes be easy (e.g.) in the graphic below, and sometimes be difficult/subjective task.

A good introduction to combining Lasso and cross validation is provided by the inventor of the Lasso, Robert Tibshirani, pages 15 and 16 here. Also here: Lasso cross validation and here

What happens when features are correlated

The case of correlated features and Lasso is more subtle and is well explained here: How does LASSO select among collinear predictors?.

But in practice there is some arbitrariness in which features are included in the model and when/why. Quoting EdM

In practice, the choice among correlated predictors in final models with either method is highly sample dependent, as can be checked by repeating these model-building processes on bootstrap samples of the same data. If there aren't too many predictors, and your primary interest is in prediction on new data sets, ridge regression, which tends to keep all predictors, may be a better choice.

How does lasso regularization select the "less important" features?

2 Answers2

What happens when features are correlated