16

Define the lasso estimate $$\hat\beta^\lambda = \arg\min_{\beta \in \mathbb{R}^p} \frac{1}{2n} \|y - X \beta\|_2^2 + \lambda \|\beta\|_1,$$ where the $i^{th}$ row $x_i \in \mathbb{R}^p$ of the design matrix $X \in \mathbb{R}^{n \times p}$ is a vector of covariates for explaining the stochastic response $y_i$ (for $i=1, \dots n$).

We know that for $\lambda \geq \frac{1}{n} \|X^T y\|_\infty$, the lasso estimate $\hat\beta^\lambda = 0$. (See, for instance, Lasso and Ridge tuning parameter scope.) In other notation, this is expressing that $\lambda_\max = \frac{1}{n} \|X^T y\|_\infty$. Notice that $\lambda_\mathrm{max} = \sup_{\hat\beta^\lambda \ne 0} \lambda.$ We can see this visually with the following image displaying the lasso solution path:

lasso solution path

Notice that on the far right hand side of the plot, all of the coefficients are zero. This happens at the point $\lambda_\mathrm{max}$ described above.

From this plot, we also notice that on the far left side, all of the coefficient are nonzero: what is the value of $\lambda$ at which any component of $\hat\beta^\lambda$ is initially zero? That is, what is $$\lambda_\textrm{min} = \min_{\exists j \, \mathrm{ s.t. } \, \hat\beta_j = 0} \lambda$$ equal to, as a function of $X$ and $y$? I'm interested in a closed form solution. In particular, I'm not interested in an algorithmic solution, such as, for instance, suggesting that LARS could find the knot through computation.

Despite my interests, it seems like $\lambda_\mathrm{min}$ may not be available in closed form, since, otherwise, lasso computational packages would likely take advantage of it when determining the tuning parameter depth during cross validation. In light of this, I'm interested in anything that can be theoretically shown about $\lambda_\mathrm{min}$ and (still) particularly interested in a closed form.

user795305
  • 2,882
  • This is stated and proven in the glmnet paper: http://web.stanford.edu/~hastie/Papers/glmnet.pdf – Matthew Drury Jul 06 '17 at 04:59
  • @MatthewDrury Thanks for sharing this! However, this paper doesn't seem to share what you seem to suggest they do. In particular, notice that my $\lambda_\max$ is their $\lambda_\min$. – user795305 Jul 06 '17 at 12:06
  • Are you sure we need [tuning-parameter] tag? – amoeba Jul 06 '17 at 12:47
  • have a look at one of the first "algorithms" used to solve the lasso: least angle regression. http://statweb.stanford.edu/~tibs/ftp/lars.pdf (i.e. https://en.wikipedia.org/wiki/Least-angle_regression) LARS:Lasso does exactly what you are looking for: It produces the full piecewise path by calculating the steps/knots at which something is happening (variable enters or leaves). Unfortunate for you: it's build up in the wrong direction, starting from the sparsest solution and you have to be careful: the scaling is different (compared with your objective function) but equivalent. – chRrr Jul 06 '17 at 13:19
  • @amoeba I thought it would be appropriate to include the tag since the question is focused on tuning parameters. Do you think that there's another, more appropriate tag? or that no tag should be used at all? (I don't mind deleting it if it doesn't fit well.) – user795305 Jul 06 '17 at 14:17
  • @chRrr Thanks for the suggestion! I did play around with LARS a bit, but, like you mention, it doesn't seem clear how to get that algorithm to furnish a closed from for the first knot. Maybe you were suggesting that it would be useful for numerical computation of this knot? In that regard, I agree that it's important to point out. Thank you! – user795305 Jul 06 '17 at 14:18
  • 1
    you a right, a closed form for the lasso solution does in general not exist (see https://stats.stackexchange.com/questions/174003/why-is-there-no-closed-form-lasso-solution). however, lars at least tells you whats going on and under which exact conditions/at which time you can add/delete a variable. i think something like this is the best you can get. – chRrr Jul 07 '17 at 08:14
  • 1
    @chRrr I'm not sure that's completely fair to say: we know that $\hat\beta^\lambda = 0$ for $\lambda \geq \frac{1}{n} |X^t y|\infty$. That is, in the extreme case of the solution being 0, we have a closed form. I'm asking if similar is true in the extreme case of the lasso estimate being dense (ie no zeros). Indeed, I'm not even interested in the exact entries of $\hat\beta\lambda$---just whether they're zero or not. – user795305 Jul 07 '17 at 12:24
  • Note to future readers: I've replaced all appearances of the notation $\lambda_\max$ with $\lambda_\min$ just now. – user795305 Jul 13 '17 at 16:16
  • Two comments. Firstly if your variables are orthogonal to each other , then the lasso coefficient for $\beta_i^{\lambda} $ is $max(0, \beta - \lambda) $. Which is a nice answer (imho) in that case. In the general case, when a lasso coefficient becomes non-zero is determined by the LARS algorithm. This is explained in ESL. – meh Jul 13 '17 at 17:23
  • @aginensky Thanks for the comment! You're right that these sorts of things become extremely trivial when the covariates are orthogonal. You're also right that the LARS algorithm is an interesting first-thought about how to do derive $\lambda_\min$ since the algorithm can be used to compute these knots. However, it doesn't seem clear how to carry out this derivation. I'd be very interested if you could provide one. – user795305 Jul 13 '17 at 17:26
  • @Ben If you look in ELS, in the same chapter that discusses lasso and ridge, lars is discussed. lars provides an algo for adding in specific amount of variables a certain amount at a time. I think that Efron called it democratic ridge or something. In any event, as I recall, the points at which one switches what variable one is adding to the lars corresponds to the 'lambda's' of lasso. Have a look at the book. I don't understand your reference to knows. hth. – meh Jul 13 '17 at 17:54
  • @aginensky I'm familiar with LARS but, unfortunately, still stuck. As suggested by your comment, when $X^T X=I$ (so that the covariates are orthogonal and scaled), we see that $\lambda_\min = \frac{1}{n} |X^T y|\min$ and $\lambda\max = \frac{1}{n} |X^T y|\max$. Since this expression of $\lambda\max$ holds for arbitrary $X$ (as shown in the linked post in my question), I hoped that the expression for $\lambda_\min$ would too. However, by running a short simulation, it seems to (consistently) over estimate the true value of $\lambda_\min$ – user795305 Jul 13 '17 at 19:22
  • Isn't the backwards calculation of the last LASSO step, in the LARS algorithm, a closed solution that finds you the lowest $\lambda$ with a zero coefficient (either just activated, or crossing the zero)? – Sextus Empiricus Oct 05 '17 at 04:30
  • The last step is in the direction of the normal of the surface with constant $\vert \beta \vert_1$, which can be found by taking the gradient of this surface (or a slightly different path in the code that I used). 2) Then calculate back until one component becomes zero. 3) And evaluate the ratio of the change of the SSE and $\vert \beta \vert_1$.

  • $$v = (X \cdot X^T)^{-1} sign(\hat{\beta}) $$

  • 1b) normalize $$v_n = \frac{v}{\sum v \cdot sign(\hat{\beta}) }$$ 2) $$d = min(\beta/v>0)$$ 3) $$ \lambda = d* \vert X v \vert_2^2 $$

    – Sextus Empiricus Oct 11 '17 at 00:18