13

In the Lasso, and ElasticNet, we use, as penalty, the l1 norm without squaring. But in the ElasticNet and Ridge, we use the l2 norm squared. Why is that, is there a particular reason (computational, optimization dynamics, statistical?) Because from a purely constrained/penalized convex optimization perspective, looks like both (squared or not squared l1 norm) would solve a similar problem (since $\|x\|_1<D \iff \|x\|_1^2<D^2$), so I guess choosing either l1 squared or plain l1 would only impact the training dynamics (is that correct) ? Could it be then because for non-smooth problems (like with $\ell_1$), one may want to have Lipschitz continuity (which is the case for non-squared l1 but not for squared l1) since we do not have smoothness ? Curious about this (would be glad to read references).

  • if you don't square it, you get a nondifferentiable penalty, which is generally a bummer since it makes optimization more complicated. But on the other hand, the nondifferentiability means it can induce exact zeros in optimization, which is exploited by, for example, the Group Lasso. – John Madden Nov 28 '22 at 22:22
  • @JohnMadden "But on the other hand, the nondifferentiability means it can induce exact zeros in optimization, which is exploited by, for example, the Group Lasso." It should give the same solutions. $|x|_1<D \iff |x|_1^2<D^2$ – Sextus Empiricus Nov 28 '22 at 22:29
  • 1
    [Ah, I should have been more clear: I was referring to the $\ell_2$ norm, not $\ell_1$]. Yes, but $\textrm{argmin} , L(x) + \Vert \mathbf{x}\Vert_2 \neq \textrm{argmin}, L(x) + \Vert \mathbf{x}\Vert_2^2$. So if we were just optimizing the penalty it would be the same, but used as a regularization term, we get different behavior. – John Madden Nov 28 '22 at 22:47
  • 1
    @JohnMadden for a specific regularisation coefficient the result is different. But when we vary the coefficients then the paths of all the solutions will be the same. – Sextus Empiricus Nov 29 '22 at 06:52
  • @SextusEmpiricus if this were true, then lasso would be equivalent to ridge when there is only one covariate. – John Madden Nov 29 '22 at 21:44
  • @JohnMadden why would it not be equivalent with only one covariate? – Sextus Empiricus Nov 29 '22 at 21:50
  • @SextusEmpiricus the easiest way to see this is that Lasso will threshold the regression coefficient to zero for sufficiently large penalty coefficient while ridge will not for any finite coefficient (unless the OLS estimator is zero too). – John Madden Nov 29 '22 at 21:52
  • 1
    @JohnMadden but still the same solutions are available. The only exception is the point where all coefficients are zero. But that is not exploited with LARS/LASSO, neither in group lasso if the square is taken on the sum of all the groups (ie $(\lambda_1 \beta_1 + \lambda_1 \beta_2 + \lambda_2 \beta_3 + \lambda_2 \beta_4)^2$, but if you take the squares for the groups $\lambda_1(\beta_1 + \beta_2)^2 + \lambda_2 (\beta_3 + \beta_4)^2$ then it would be equivalent to some sort of mix between lasso and ridge regression. (Also, it is not non-differentiability that creates this zero point) – Sextus Empiricus Nov 30 '22 at 06:25
  • @SextusEmpiricus I'm very interested in your comment that it's not nondifferentiability that creates the zero point, can you point me to any references for that? When I said that this was the case, I had section 2 of Fan and Li 2001 in mind: https://fan.princeton.edu/papers/01/penlike.pdf (in particular, "From the foregoing discussion, we conclude that a penalty function satisfying the conditions of sparsity and continuity must be singular at the origin"). – John Madden Nov 30 '22 at 15:54
  • @JohnMadden - In normal ridge regression you have a zero slope (in all directions) for $|\beta|^2$ in the point $\beta=0$. So for all finite $\lambda$ the solution can not be $\beta = 0$ (since an infinitesimal step can be made to reduce the sum of squared residuals without increasing the penalty). – Sextus Empiricus Nov 30 '22 at 16:44
  • For LASSO this is not the same since: the penalty is $\lvert \beta \rvert_1$ (so it is not quadratic with zero slope). Because of that LASSO will have some limiting value $\lambda_{lim}$ above which all the solutions are zero because the penalty term (multiplied by $\lambda$) will increase more than the residual sum of squares decreases.
  • – Sextus Empiricus Nov 30 '22 at 16:44
  • Quoted from https://stats.stackexchange.com/a/340385/ So the reason for a limiting $\lambda$ for Lasso, above which all coefficients are zero is because the penalty has a non-zero directional derivative in the point zero. For ridge regression the derivative is zero. (See also LASSO: Deriving the smallest lambda at which all coefficient are zero) – Sextus Empiricus Nov 30 '22 at 16:47
  • I have to go through that article but it might be that their 'singular' is related to the derivative seen as a one dimensional cost function which has an undefined derivative in the point zero. But, the directional derivative still exists. – Sextus Empiricus Nov 30 '22 at 16:53
  • @SextusEmpiricus so it seems like what we've settled on is "nondifferentiability is not the only way to understand sparse optima" – John Madden Dec 01 '22 at 19:21
  • @JohnMadden the zero point for a specific limiting $\lambda$ is different from the sparsity that you get with LARS/LASSO. That sparsity is not dependent on whether or not you take a square for the penalty term; $ |\beta|_1^2$ and $|\beta|_1$ both give sparse solutions. – Sextus Empiricus Dec 01 '22 at 19:54