Why is LASSO considered good for selection but less good for reducing variance

Question

What are some explications why LASSO is considered better in selection, whereas, it is less good on estimation*.

In other words what are the pros and cons of LASSO?

* An example where something like this is stated, where LASSO is being portrayed as only good for selection, is the question: Using LASSO only for feature selection

the benefit of $l_1$ regularization for performing feature selection, but also the benefit of $l_2$ regularization for reducing overfitting

There is a lot of literature on this topic (while I don't fully understand the question). Please formulate your question so that it is answerable within a reasonable timeframe. What problem are you facing? What steps have you done to solve it and what particular issue you don't understand? As it stands now, his question is not answerable. — sashkello, Apr 26 '13 at 00:25

score 2 · Answer 1 · answered Apr 25 '13 at 23:49

2

What do you mean by 'bad in estimation' here? What are you comparing against?

When doing estimation in the presence of model selection, with most forms of selection your nonzero parameter estimates are biased away from zero. Some shrinkage would seem to be not only beneficial in reducing that bias, but prudent.

If you think in terms of both bias and variance (or some other, perhaps more robust measure of scale; let me say 'variability' in a more general sense) of your predictions, your variability will tend to be smaller if your parameter estimates are somewhat biased toward zero. There's a tradeoff between the amount of downward bias and the variability of predictions.

answered Apr 25 '13 at 23:49

Glen_b

282,281

First, I am sorry of my english. Then, I think (perhaps it is not correct), there is a difference between the SELECTION step and The ESTIMATION Step. In other words, LASSO will give us the important variables, but when we try to calculate the error between the estimator $\hat{\beta}$ and the true $\beta$, I imagine that we had loose some variables that can be "ameliorate" the estimation. Sorry again, maybe I didn't explain more. By the way, THANK YOU SO MUCH for your responses. Sincerely, Layth – Layth TunoPariso Apr 26 '13 at 09:40
"when we try to calculate the error between the estimator $\hat\beta$ and the true $β$" -- as with predictions, there's two competing effects going on, one helping you and one making things worse; variance is generally improved, bias is made worse. A little shrinkage may be a good thing. A lot of shrinkage may be too much. – Glen_b Apr 27 '13 at 00:57

Sextus Empiricus · Answer 2 · 2023-12-27T07:15:52.830

This relates to the question How does an ideal prior distribution needs a probability mass on zero to reduce variance, and have fat tails to reduce bias? We can regard regularisation such as LASSO and ridge regression as placing a prior on the magnitude of the parameters (like in the image below with Laplace/Gaussian priors for LASSO/ridge regression).

These priors can have two different effects.

Reduce estimates of parameters all the way down to zero. This is the topic of that linked question where priors are considered with a probability mass at zero. Such priors place extra focus on reducing parameter estimates to zero. This is useful for parameter/regressor selection when we believe/assume that most parameters should be zero.
Shrink parameter estimates to smaller values. This helps to reduce variance and overfitting in a similar way as shrinkage estimators, in situations where we do not believe that many parameters should be zero.

LASSO is, in comparison to ridge regression, placing more weight of the prior close to zero and this will make it more extreme in the first task of selecting parameters/regressors. At the same time it has also fat tails. That will cause the non-zero parameters estimates to be less extremely regularised. This lack of regularisation due to shrinking can be considered an advantage or a disadvantage. It depends on the problem.

When we want to shrink parameters instead of select parameters (to reduce variance and overfitting of noise) then ridge places more focus on that task in comparison to LASSO and can be considered better.

Image: comparison of priors for LASSO (Laplace distributed prior) and ridge regression (Gaussian distributed prior). The scales can differ when regularisation parameters are changed, but the shapes are the same. The Laplace distribution (for LASSO) is more pointy near zero, but has longer tails far away from zero. The Gaussian distribution (for ridge regression) has a more blunt peak at zero and places more focus on smaller tails, leading to smaller parameter estimates, but not neccesarily close to zero.

score 1 · Answer 3 · answered Oct 27 '17 at 23:55

One of the key limits of LASSO is that due to the L1 norm, if the predictors are correlated, LASSO will only randomly select one and shrink the beta of others to 0. That's why using elastic net or ridge regression will help to save those predictors that are correlated to the final model. With this limitation of LASSO, it's always good to run multiple times of cross-validation to get relatively stable estimates.

score 0 · Answer 4 · answered Sep 27 '17 at 15:45

Lasso actually just added an L1 regularization to a regression. L1 regularization means a constraint which does not permit to the w in this formula ($||Aw-B||_2+||w||_1$) to increase(When we want to make the sum of absolute values of the w decrease, so each element should decrease too). Actually, for decreasing the formula, w will decrease. The minimum of the element of w is 0. Therefore w prefer to be 0 considering that the whole equation remains minimum too(So there is a trade-off). Therefore we will find the w which minimizes the $||Aw-B||$(not the most minimum value which can be actually) and also most of its element is zero. This is the spars solution for $||Aw-B||$. In the problems in which you know that all the features of A do not involve in constructing B, but we do not know the exact elements, Lasso is useful. Because the elements which are not zero, are those elements which are important for constructing B.

Why is LASSO considered good for selection but less good for reducing variance

4 Answers4