Questions tagged [lasso]

A regularization method for regression models that shrinks coefficients towards zero, making some of them equal to zero. Thus lasso performs feature selection.

LASSO is an acronym for least absolute shrinkage and selection operator. It is a form of regularization used in the estimation of regression coefficients that shrinks coefficient estimates by penalizing their absolute value (i.e. the $L_1$ norm of the estimates). Some coefficients may be shrunk to zero; thus the lasso performs feature selection. The lasso is equivalent to the Bayesian estimation problem where an i.i.d. standard Laplacian prior is used for the regression parameters.

In the context of linear regression, we can formulate the LASSO problem as:

Given a set of training data $(x_1,y_1),...,(x_n,y_n)$ where $x_i \in \mathbb{R}^{p}$ we attempt to find a vector of coefficients $\hat{\beta}_{LASSO} \in \mathbb{R}^{p}$ such that the following holds:

$$\hat{\beta}_{LASSO} = \underset{\beta} {\text{argmin}} \sum\limits_{i=1}^{N}(y_i - \sum\limits_{j=1}^{p}x_{i,j}\beta_{j})^2$$

$$ \text{subject to } \sum\limits_{j=1}^{p}|\beta_{j}| \leq t$$

Due to the nature of the $L_1$ penalty, there is no closed form solution for $\hat{\beta}_{LASSO}$, so computing the LASSO is a quadratic programming problem, unlike ridge where a closed form exists.

In a Bayesian context, we can derive an equivalent regularization penalty by finding the $\beta$ which maximizes the posterior, the MAP estimate:

Assume a Laplacian prior on $\beta$:

$$\pi(\beta|\tau) \propto e^{-\frac{1}{2\tau}\sum_{j=1}^{p} |\beta_j|}$$

If we assume that $y \sim N(X\beta,\sigma^2 I)$, then the posterior of $\beta$:

$$P(\beta|X,Y,\sigma^2,\tau) \propto \prod_{i=1}^{n}e^{-\frac{1}{2\sigma^2}(y_i - x_i^T\beta)^2}e^{-\frac{1}{2\tau}\sum_{j=1}^{p} |\beta_j|} $$

The MAP estimate above is equivalent to minimizing twice the negative log:

$$ \propto \frac{1}{\sigma^2} \sum_{i=1}^{n} (y_i - x_i^T \beta)^2 + \frac{1}{\tau}\sum_{j=1}^{p} |\beta_j| $$

Let $\lambda = \frac{\sigma^2}{\tau}$. The final expression is:

$$\hat{\beta} = \underset{\beta} {\text{argmin}} \sum_{i=1}^{n} (y_i - x_i\beta)^2 + \lambda\sum_{j=1}^{p} |\beta_j| = \hat{\beta}_{LASSO}$$

1472 questions
72
votes
2 answers

Derivation of closed form lasso solution

For the lasso problem $\min_\beta (Y-X\beta)^T(Y-X\beta)$ such that $\|\beta\|_1 \leq t$. I often see the soft-thresholding result $$ \beta_j^{\text{lasso}}= \mathrm{sgn}(\beta^{\text{LS}}_j)(|\beta_j^{\text{LS}}|-\gamma)^+ $$ for the orthonormal…
Gary
  • 1,601
13
votes
2 answers

Can $\|\beta^*\|_2$ increase when $\lambda$ increases in Lasso?

If $\beta^*=\mathrm{arg\,min}_{\beta} \|y-X\beta\|^2_2+\lambda\|\beta\|_1$, can $\|\beta^*\|_2$ increase when $\lambda$ increases? I think this is possible. Although $\|\beta^*\|_1$ does not increase when $\lambda$ increases (my proof),…
Ziyuan
  • 1,746
13
votes
1 answer

Connection between Lasso formulations

This question might be dumb, but I noticed that there are two different formulations of the Lasso regression. We know that the Lasso problem is to minimize the objective consisting of the square loss plus the $L$-1 penalty term, expressed as…
SixSigma
  • 2,292
12
votes
2 answers

Lasso modification for LARS

I am trying to understand how Lars algorithm can be modified to generate Lasso. While I do understand LARS, I am not able to see the Lasso modification from the paper by Tibshirani et al. In particular I don't see why the sign condition in that the…
11
votes
3 answers

How defensible is it to choose $\lambda$ in a LASSO model so that it yields the number of nonzero predictors one desires?

When I determine my lambda through cross-validation, all coefficients become zero. But I have some hints from the literature that some of the predictors should definitely affect the outcome. Is it rubbish to arbitrarily choose lambda so that there…
miura
  • 3,684
11
votes
2 answers

1/2 on lagrangian equation from lasso

I´ve read this fantastic book The elements of statistical learning and I have a question about the lasso equation for the Lasso problem in its Lagrangian form: $\hat{\beta}_{lasso} = argmin \{ \frac{1}{2} \sum_{i=1}^{N}(y_i -\beta_0 -\sum_{j=1}^{p}…
ancamar
  • 111
  • 3
11
votes
2 answers

Coordinate descent soft-thresholding update operator for LASSO

I was reading this paper (Friedman et al, 2010, Regularization Paths for Generalized Linear Models via Coordinate Descent) describing the coordinate descent algorithm for LASSO, and I can't quite figure out how the soft-thresholding update for each…
aenima
  • 353
  • 1
  • 2
  • 8
9
votes
1 answer

Expressing the LASSO regression constraint via the penalty parameter

Given the two equivalent formulations of the problem for LASSO regression, $\min(RSS + \lambda\sum|\beta_i|)$ and $\min(RSS)$ such that $\sum|\beta_i|\leq t$, how can we express the one-to-one correspondence between $\lambda$ and $t$?
8
votes
1 answer

Why isn't the Dantzig selector popular in applied statistics?

Lasso-like methods have become pretty common in applied statistics but the Dantzig selector remains unpopular despite having great properties (minimax optimality). Why hasn't it become more popular?
6
votes
1 answer

Bias of Tibshirani's Lasso estimator

I am searching for a theorem that gives upper bounds for the bias of the Lasso estimator from Tibshirani[1]. Do anybody know such a theorem? [1] Tibshirani, R., (1996). “Regression Shrinkage and Selection via the Lasso”, Journal of the Royal…
Markus
  • 63
  • 3
5
votes
2 answers

How is $\lambda$ tuning parameter in lasso logistic regression generated

I know glmnet(x,y) generates $\lambda$ but I am very curious to know the actual formula that is behind this, generating $\lambda$.
bison2178
  • 487
5
votes
1 answer

LASSO and compatibility constant

I am new on this web-site and coming from the field of economics (although interested in High Dimensional Statistics), I am reading Statistics for High Dimensional Data of Bühlmann and Van De Geer. I struggle to get an intuition of what is the…
A.Barra
  • 51
  • 2
5
votes
1 answer

LASSO with two predictors

I have a question regarding LASSO with two predictors, somewhat related to another one of mine posted here. I am trying to illustrate equation (6) of the original paper by Tibshirani, JRSSB 1996, which says that the LASSO estimates…
4
votes
0 answers

lasso fails with few large effects and many small effects

I have an application that is predicting height based on sex and DNA mutations. As height is very different depending on sex, sex is a variable with a strong effect in my prediction (~13). On the contrary, I have many DNA mutations (say 500,000…
F. Privé
  • 231
4
votes
0 answers

What is the differences between LASSO and SALSA?

When I see their formulation, it is the same. In SALSA, the formulation is: \begin{equation*} \min_{x} \phi(x) \text{ subject to } \frac{1}{2} \|Ax-y\|_F \leq \epsilon \end{equation*} This constrained problem is then being transformed into an…
1
2 3