16

My understanding of LASSO regression is that the regression coefficients are selected to solve the minimisation problem:

$$\min_\beta \|y - X \beta\|_2^2 \ \\s.t. \|\beta\|_1 \leq t$$

In practice this is done using a Lagrange multiplier, making the problem to solve

$$\min_\beta \|y - X \beta\|_2^2 + \lambda \|\beta\|_1 $$

What is the relationship between $\lambda$ and $t$? Wikipedia unhelpfully simply states that is "data dependent".

Why do I care? Firstly for intellectual curiosity. But I am also concerned about the consequences for selecting $\lambda$ by cross-validation.

Specifically, if I'm doing n-fold cross validation, I fit n different models to n different partitions of my training data. I then compare the accuracy of each of the models on the unused data for a given $\lambda$. But the same $\lambda$ implies a different constraint ($t$) for different subsets of the data (i.e., $t=f(\lambda)$ is "data dependent").

Isn't the cross validation problem I really want to solve to find the $t$ that gives the best bias-accuracy trade-off?

I can get a rough idea of the size of this effect in practice by calculating $\|\beta\|_1$ for each cross-validation split and $\lambda$ and looking at the resulting distribution. In some cases the implied constraint ($t$) can vary quiet substantially across my cross-validation subsets. Where by substantially I mean the coefficient of variation in $t>>0$.

Ferdi
  • 5,179
  • 8
    Upvoting to cancel out the unexplained downvote. The question is well outside my expertise but it seems reasonably formulated. – mkt Dec 22 '17 at 10:20

2 Answers2

3

This is the standard solution for ridge regression:

$$ \beta = \left( X'X + \lambda I \right) ^{-1} X'y $$

We also know that $\| \beta \| = t$, so it must be true that

$$ \| \left( X'X + \lambda I \right) ^{-1} X'y \| = t $$.

which is possible, but not easy to solve for $\lambda$.

Your best bet is to just keep doing what you're doing: compute $t$ on the same sub-sample of the data across multiple $\lambda$ values.

shadowtalker
  • 12,551
  • Can you explain the intuition for why $||\beta|| = t$? I would think that this would only be true if the minima for the $RSS^{OLS}$ solution falls outside of the L1-ball given a specific $\lambda$. Otherwise, wouldn't $||\beta|| \le t$ ? – Jacob Bumgarner Nov 17 '23 at 18:05
2

This question relates to Is the magnitude coefficient vector in Ridge regression monotonic in lambda? which sketches a situation for ridge regression, but it is similar for Lasso.

relation RSS and beta

Consider the relationship of the optimal RSS as a function of the value of $t = \vert \beta \vert$. Say that this function is $RSS = f(t)$.

The goal of lasso is to find the $\beta$ which minimizes $$\text{Cost}(\beta) = RSS(\beta) + \lambda \vert\beta\vert$$

We could describe the cost as well as a function of the magnitude of the coefficients $t$

$$\text{Cost}(t) = f(t) + \lambda t$$

this is minimized when

$$\frac\partial{\partial t} \text{Cost}(t) = \frac\partial{\partial t} f(t) + \lambda = 0 $$

And the relationship between $\lambda$ and $t$ is

$$\lambda = - \frac\partial{\partial t} f(t)$$

This function $f(t)$, the size of the RSS for a given size of the estimates of the coefficients, is dependent on the data.