24

I have read the most popular books in statistical learning

1- The elements of statistical learning.

2- An introduction to statistical learning.

Both mention that ridge regression has two formulas that are equivalent. Is there an understandable mathematical proof of this result?

I also went through Cross Validated, but I can not find a definite proof there.

Furthermore, will LASSO enjoy the same type of proof?

enter image description here

Noah
  • 33,180
  • 3
  • 47
  • 105
jeza
  • 2,089
  • 3
  • 25
  • 43
  • 2
    https://en.wikipedia.org/wiki/Duality_(optimization)#The_strong_Lagrangian_principle:_Lagrange_duality – Taylor May 27 '18 at 18:56
  • 1
    Lasso is not a form of ridge regression. – Xi'an May 28 '18 at 17:10
  • @jeza, Could you explain what's missing in my answer? It really derives all can be derived about the connection. – Royi May 30 '18 at 20:13
  • @jeza, Could you be specific? Unless you know the Lagrangian concept for constrained problem it is hard to give a concise answer. – Royi May 31 '18 at 04:34
  • 1
    @jeza , a constrained optimization problem can be converted into optimization of the Lagrangian function / KKT conditions (as explained in the current answers). This principle has already many different simple explanations all over the internet. In what direction is more explanation of the proof necessary? Explanation/proof of the Lagrangian multiplier/function, explanation/proof how this problem is a case of optimization that relates to the method of Lagrange, difference KKT/Lagrange, explanation of the principle of regularization, etc? – Sextus Empiricus May 31 '18 at 07:05
  • Is this question a possible duplicate? https://math.stackexchange.com/questions/335306/why-are-additional-constraint-and-penalty-term-equivalent-in-ridge-regression – Sextus Empiricus May 31 '18 at 09:44
  • @jeza, I tried adding some intuitive explanation. Let me know what you think. – Royi May 31 '18 at 20:13

4 Answers4

23

The classic Ridge Regression (Tikhonov Regularization) is given by:

$$ \arg \min_{x} \frac{1}{2} {\left\| x - y \right\|}_{2}^{2} + \lambda {\left\| x \right\|}_{2}^{2} $$

The claim above is that the following problem is equivalent:

$$\begin{align*} \arg \min_{x} \quad & \frac{1}{2} {\left\| x - y \right\|}_{2}^{2} \\ \text{subject to} \quad & {\left\| x \right\|}_{2}^{2} \leq t \end{align*}$$

Let's define $ \hat{x} $ as the optimal solution of the first problem and $ \tilde{x} $ as the optimal solution of the second problem.

The claim of equivalence means that $ \forall t, \: \exists \lambda \geq 0 : \hat{x} = \tilde{x} $.
Namely you can always have a pair of $ t $ and $ \lambda \geq 0 $ such the solution of the problem is the same.

How could we find a pair?
Well, by solving the problems and looking at the properties of the solution.
Both problems are Convex and smooth so it should make things simpler.

The solution for the first problem is given at the point the gradient vanishes which means:

$$ \hat{x} - y + 2 \lambda \hat{x} = 0 $$

The KKT Conditions of the second problem states:

$$ \tilde{x} - y + 2 \mu \tilde{x} = 0 $$

and

$$ \mu \left( {\left\| \tilde{x} \right\|}_{2}^{2} - t \right) = 0 $$

The last equation suggests that either $ \mu = 0 $ or $ {\left\| \tilde{x} \right\|}_{2}^{2} = t $.

Pay attention that the 2 base equations are equivalent.
Namely if $ \hat{x} = \tilde{x} $ and $ \mu = \lambda $ both equations hold.

So it means that in case $ {\left\| y \right\|}_{2}^{2} \leq t $ one must set $ \mu = 0 $ which means that for $ t $ large enough in order for both to be equivalent one must set $ \lambda = 0 $.

On the other case one should find $ \mu $ where:

$$ {y}^{t} \left( I + 2 \mu I \right)^{-1} \left( I + 2 \mu I \right)^{-1} y = t $$

This is basically when $ {\left\| \tilde{x} \right\|}_{2}^{2} = t $

Once you find that $ \mu $ the solutions will collide.

Regarding the $ {L}_{1} $ (LASSO) case, well, it works with the same idea.
The only difference is we don't have closed for solution hence deriving the connection is trickier.

Have a look at my answer at StackExchange Cross Validated Q291962 and StackExchange Signal Processing Q21730 - Significance of $ \lambda $ in Basis Pursuit.

Remark
What's actually happening?
In both problems, $ x $ tries to be as close as possible to $ y $.
In the first case, $ x = y $ will vanish the first term (The $ {L}_{2} $ distance) and in the second case it will make the objective function vanish.
The difference is that in the first case one must balance $ {L}_{2} $ Norm of $ x $. As $ \lambda $ gets higher the balance means you should make $ x $ smaller.
In the second case there is a wall, you bring $ x $ closer and closer to $ y $ until you hit the wall which is the constraint on its Norm (By $ t $).
If the wall is far enough (High value of $ t $) and enough depends on the norm of $ y $ then i has no meaning, just like $ \lambda $ is relevant only of its value multiplied by the norm of $ y $ starts to be meaningful.
The exact connection is by the Lagrangian stated above.

Resources

I found this paper today (03/04/2019):

Royi
  • 1,135
  • does the equivalent means that the \lambda and \t should be the same. Because I can not see that in the proof. thanks – jeza May 28 '18 at 03:04
  • 1
    @jeza, As I wrote above, for any $ t $ there is $ \lambda \geq 0 $ (Not necessarily equal to $ t $ but a function of $ t $ and the data $ y $) such that the solutions of the two forms are the same. – Royi May 28 '18 at 04:07
  • 4
    @jeza, both $\lambda$ & $t$ are essentially free parameters here. Once you specify, say, $\lambda$, that yields a specific optimal solution. But $t$ remains a free parameter. So at this point the claim is that there can be some value of $t$ that would yield the same optimal solution. There are essentially no constraints on what that $t$ must be; it's not like it has to be some fixed function of $\lambda$, like $t=\lambda/2$ or something. – gung - Reinstate Monica May 29 '18 at 20:46
  • @Royi, I would like to know 1- why your formula have (1/2), while the formulas in question not? 2- are using KKT to show the equivalence of the two formulas? 3- if yes, I am still can't see that equivalence. I am not sure but what I expect to see is that proof to show that formula one = formula two. – jeza Apr 01 '19 at 14:49
  • Just easier when you differentiate the LS term. You can move form my $ \lambda $ to the OP $ \lambda $ by factor of two. 2. I used KKT fo the 2nd case. The first case has no constraints, hence you can just solve it. 3. There is no closed form equation between them. I showed the logic and how you can create a graph connecting them. But as I wrote it will change for each $ y $ (It is data dependent).
  • – Royi Apr 01 '19 at 15:09
  • 1
    @Royi, I have a few things. 1- what do you mean by "move form my to the OP by a factor of two. 2" not clear to me yet. 2- I am still cannot understand that logic. 3- I am confused because you introduce "mu" with "t" and "lambda", also "‖‖^2_2≤ one must set =0 which means that for large enough in order for both to be equivalent one must set =0. (+2)−1(+2)−1=. This is basically when ‖̃ ‖22=" not clear to me. 4- I expect to see that "t" = "lambda" then you mentioned "mu" = "lambda". Could you please help me by elaborating or editing the answer. Thx – jeza Apr 02 '19 at 23:17
  • @Jeza, Could you open new question, post a link to it here and I will try answering it? – Royi Apr 03 '19 at 02:29
  • 1
    @Royi, I think the question will be duplicated! – jeza Apr 04 '19 at 10:14
  • So I really don't understand the issue. I basically derived the KKT equations for the constrained case and tried to tie it to the regularized case. – Royi Apr 04 '19 at 10:26
  • @Royi, I do not understand the logic on how you go from constraint case to regularized one. – jeza Apr 04 '19 at 12:45
  • I ask which values of $ \lambda $ in the regularized case and $ t $ in the constrained case will yield the same answer. Again, you may open a question which asks for showing the function which maps one to each other. I will try show it in code. – Royi Apr 04 '19 at 14:51
  • @Royi, I added a new question. – jeza Apr 04 '19 at 16:06
  • @jeza, Please paste a link here. – Royi Apr 04 '19 at 16:34
  • @Royi, https://stats.stackexchange.com/questions/401212/how-to-show-the-equivalence-between-the-regularized-regression-and-their-constra – jeza Apr 04 '19 at 16:40
  • @Royi, could you please have a look at the question? – jeza Apr 07 '19 at 23:56
  • @jeza, I promise I will. Just a tight days. Once I have a free hour I will do it. – Royi Apr 08 '19 at 06:01
  • @Royi, thanks, now there is a bounty on the question. – jeza Apr 08 '19 at 11:44
  • @jeza, I added my answer. – Royi Apr 13 '19 at 06:56
  • @Royi, many thanks, I think I still need more time to get the idea, I am learning. – jeza Apr 17 '19 at 17:05
  • Well, At least you chose a really interesting subject to learn :-). – Royi Apr 17 '19 at 21:12
  • 1
    The most confusing part is that the relation between $\lambda$ and $t$ is data-dependent. @Royi has mentioned this twice in the comments, but IMO this is an important point that deserves to be put in bold in the answer. Although this fact is somewhat obvious (for given $\lambda, t$ we scale $y$ away from the regression line, so that the solution penalised by $\lambda$ doesn't fit into the constraint set by $t$), the phrasing "equivalent way to write the problem" is misleading, as in order to write the problem in the equivalent way we are required to first solve it. – paperskilltrees May 10 '22 at 23:58
  • Note that HTF in their ESL book put it this way, without mentioning any dependence on data: "There is a one-to-one correspondence between the parameters λ in (3.41) and t in (3.42)". – paperskilltrees May 11 '22 at 00:02