In the Lasso, and ElasticNet, we use, as penalty, the l1 norm without squaring. But in the ElasticNet and Ridge, we use the l2 norm squared. Why is that, is there a particular reason (computational, optimization dynamics, statistical?) Because from a purely constrained/penalized convex optimization perspective, looks like both (squared or not squared l1 norm) would solve a similar problem (since $\|x\|_1<D \iff \|x\|_1^2<D^2$), so I guess choosing either l1 squared or plain l1 would only impact the training dynamics (is that correct) ? Could it be then because for non-smooth problems (like with $\ell_1$), one may want to have Lipschitz continuity (which is the case for non-squared l1 but not for squared l1) since we do not have smoothness ? Curious about this (would be glad to read references).
-
if you don't square it, you get a nondifferentiable penalty, which is generally a bummer since it makes optimization more complicated. But on the other hand, the nondifferentiability means it can induce exact zeros in optimization, which is exploited by, for example, the Group Lasso. – John Madden Nov 28 '22 at 22:22
-
@JohnMadden "But on the other hand, the nondifferentiability means it can induce exact zeros in optimization, which is exploited by, for example, the Group Lasso." It should give the same solutions. $|x|_1<D \iff |x|_1^2<D^2$ – Sextus Empiricus Nov 28 '22 at 22:29
-
1[Ah, I should have been more clear: I was referring to the $\ell_2$ norm, not $\ell_1$]. Yes, but $\textrm{argmin} , L(x) + \Vert \mathbf{x}\Vert_2 \neq \textrm{argmin}, L(x) + \Vert \mathbf{x}\Vert_2^2$. So if we were just optimizing the penalty it would be the same, but used as a regularization term, we get different behavior. – John Madden Nov 28 '22 at 22:47
-
1@JohnMadden for a specific regularisation coefficient the result is different. But when we vary the coefficients then the paths of all the solutions will be the same. – Sextus Empiricus Nov 29 '22 at 06:52
-
@SextusEmpiricus if this were true, then lasso would be equivalent to ridge when there is only one covariate. – John Madden Nov 29 '22 at 21:44
-
@JohnMadden why would it not be equivalent with only one covariate? – Sextus Empiricus Nov 29 '22 at 21:50
-
@SextusEmpiricus the easiest way to see this is that Lasso will threshold the regression coefficient to zero for sufficiently large penalty coefficient while ridge will not for any finite coefficient (unless the OLS estimator is zero too). – John Madden Nov 29 '22 at 21:52
-
1@JohnMadden but still the same solutions are available. The only exception is the point where all coefficients are zero. But that is not exploited with LARS/LASSO, neither in group lasso if the square is taken on the sum of all the groups (ie $(\lambda_1 \beta_1 + \lambda_1 \beta_2 + \lambda_2 \beta_3 + \lambda_2 \beta_4)^2$, but if you take the squares for the groups $\lambda_1(\beta_1 + \beta_2)^2 + \lambda_2 (\beta_3 + \beta_4)^2$ then it would be equivalent to some sort of mix between lasso and ridge regression. (Also, it is not non-differentiability that creates this zero point) – Sextus Empiricus Nov 30 '22 at 06:25
-
@SextusEmpiricus I'm very interested in your comment that it's not nondifferentiability that creates the zero point, can you point me to any references for that? When I said that this was the case, I had section 2 of Fan and Li 2001 in mind: https://fan.princeton.edu/papers/01/penlike.pdf (in particular, "From the foregoing discussion, we conclude that a penalty function satisfying the conditions of sparsity and continuity must be singular at the origin"). – John Madden Nov 30 '22 at 15:54
-
@JohnMadden - In normal ridge regression you have a zero slope (in all directions) for $|\beta|^2$ in the point $\beta=0$. So for all finite $\lambda$ the solution can not be $\beta = 0$ (since an infinitesimal step can be made to reduce the sum of squared residuals without increasing the penalty). – Sextus Empiricus Nov 30 '22 at 16:44
-
- For LASSO this is not the same since: the penalty is $\lvert \beta \rvert_1$ (so it is not quadratic with zero slope). Because of that LASSO will have some limiting value $\lambda_{lim}$ above which all the solutions are zero because the penalty term (multiplied by $\lambda$) will increase more than the residual sum of squares decreases.
– Sextus Empiricus Nov 30 '22 at 16:44 -
Quoted from https://stats.stackexchange.com/a/340385/ So the reason for a limiting $\lambda$ for Lasso, above which all coefficients are zero is because the penalty has a non-zero directional derivative in the point zero. For ridge regression the derivative is zero. (See also LASSO: Deriving the smallest lambda at which all coefficient are zero) – Sextus Empiricus Nov 30 '22 at 16:47
-
I have to go through that article but it might be that their 'singular' is related to the derivative seen as a one dimensional cost function which has an undefined derivative in the point zero. But, the directional derivative still exists. – Sextus Empiricus Nov 30 '22 at 16:53
-
@SextusEmpiricus so it seems like what we've settled on is "nondifferentiability is not the only way to understand sparse optima" – John Madden Dec 01 '22 at 19:21
-
@JohnMadden the zero point for a specific limiting $\lambda$ is different from the sparsity that you get with LARS/LASSO. That sparsity is not dependent on whether or not you take a square for the penalty term; $ |\beta|_1^2$ and $|\beta|_1$ both give sparse solutions. – Sextus Empiricus Dec 01 '22 at 19:54
3 Answers
But in the ElasticNet and Ridge, we use the l2 norm squared. Why is that, is there a particular reason (computational, optimization dynamics, statistical?)
A possible reason for the l2 norm being squared in ridge regression (or Tikhonov regularisation) is that it allows an easy expression for the solution of the problem $$\hat\beta = (\textbf{X}^T\textbf{X} + \lambda \textbf{I})^{-1} \textbf{X}^T y$$ where $\textbf{X}$ is the regressor matrix or design matrix, $\lambda$ the scaling parameter for the penalty, $\textbf{I}$ the identity matrix, $y$ the observations, and $\hat{\beta}$ the estimate of the coefficients.
That solution can be derived by taking the derivative of the cost function and setting it equal to zero
$$\nabla_{\hat\beta} \left[ (y - \textbf{X}\hat\beta )^T(y - \textbf{X}\hat\beta ) + \hat\beta^T \lambda \textbf{I} \hat\beta \right] = \textbf{X}^T \left (y - \textbf{X}\hat\beta \right)+ \lambda \textbf{I} \hat\beta = \textbf{0}$$
- 77,915
A practical reason for squaring the L2 (that is not specific to ridge regression) is that "squaring" the L2 consists of not bothering to take the square root in the first place. And since $x^2$ is strictly increasing (for non-negative x), $||f(\textbf{x})||_2$ and $||f(\textbf{x})||_2^2$ will be optimal at the same point, so if the L2 is the optimization target (as opposed to a regularization penalty or something), it's a free speed gain. Squaring the L1 also doesn't change the optimal point, but it takes more time, so there's no reason to do it.
(If it is being used as a regularization penalty, we might still favor the faster option unless we have a specific reason to want the L2 exactly instead of just something (nonlinearly) proportional to it.)
- 197
-
4That is a good comment. The $\ell_p$ norm is defined as $$\ell_p = \left(\sum_{i=1}^n |x_i|^p\right)^{1/p}$$ but it can sometimes be easier to get rid of the $1/p$ power which results in $$\ell_p^\prime = \sum_{i=1}^n |x_i|^p$$ In the case of $p=1$ nothing changes $\ell_1^\prime = \ell_1$. But, in the case of $p=2$ there is a difference $\ell_2^\prime = (\ell_2)^2$. From the point of view of $\ell_p$ the ridge regression has this extra square step. From the point of view of $\ell_p^\prime$ there is no difference between Lasso and ridge regression. – Sextus Empiricus Nov 04 '22 at 19:31
In case this may help people, I guess also one potential additional reason why we use the unsquared l1 norm is that when using the unsquared l1 norm, the corresponding optimization algorithms are quite efficient: for instance the proximal gradient descent algorithm just uses a simple soft-thresholding operator (the separable structure of the unsquared l1 norm also leads to even more efficient algorithms, see here: https://arxiv.org/pdf/2204.07826.pdf). But when using the squared l1 norm, the corresponding proximal gradient descent algorithm seems a bit more complex (cf. the proximal operator of the squared k-support norm here https://arxiv.org/pdf/1204.5043.pdf (take k=1)).
- 418
- 2
- 11