7

In every machine learning book we see that it is roughly mentioned that the ridge regression: $$p_1^* = \min\limits_{\beta} \ \left( \mathrm{RSS} + \lambda\sum_{j=1}^p \beta_j^2 \right)$$ is equivalent to: $$ p_2^* = \min_{\beta} \left\{ RSS\,\Bigg\vert\,\sum_{j=1}^p \beta_j^2 < t \right\} $$ for some $t$. The notation I followed is the classical notation, i.e., we have $p$ predictors in a Linear Regression model, $\beta_j$ are the coefficients which we want to find, $\lambda$ is the ridge penalization parameter, etc... Firstly I assume the equivalence is not on the objective values, but rather the minimizers.

I am now wondering this equivalence. Starting from the second formulation, obviously the Lagrangian function is: $$L(\beta, \lambda) = \left( \mathrm{RSS} + \lambda \sum_{j=1}^p \beta_j^2 - \lambda \sum_{j=1}^p t \right) $$ for $\lambda \geq 0$. We know that the dual function lower bounds the original problem, i.e.,: $$\min_\beta L(\beta, \lambda \ | \ \lambda) \leq p_2^*. $$

And since we are minimizing $L(\beta, \lambda)$ over $\beta$, we can get rid of the $t$ term and see that the problem is $p_1^*$. Now, the argument minimizing $p_1^*$ provides a lower bound for $p_2^*$. Still, I don't see how we can use these two models equivalently.

independentvariable
  • 3,980
  • 10
  • 36

1 Answers1

3

I'm going to assume that the inequality in the second formulation is weak ($\le$), and that $\lambda$ is fixed. In terms of "equivalence", it is correct to say that the two formulations have the same solution when $t=\sum_{j=1}^p \overline{\beta}^2$, where $\overline{\beta}$ is the optimal solution to the first formulation. That observation is rather trivial: if $\overline{\beta}$ is not optimal in the second formulation with that choice of $t$, then there exists a different $\beta$ that produces smaller values of both terms in the first objective function, contradicting optimality of $\overline{\beta}$ there.

prubin
  • 39,078
  • 3
  • 37
  • 104