Closed form solution to lasso problem when data matrix is diagonal

Question

$\newcommand{\diag}{\operatorname{diag}}$We have the problem: $$\min_{w\in\mathbb{R}^{d}}\left( \frac{1}{n}\sum_{i=1}^{n} \left( \langle w,x_{i}\rangle-y_{i} \right)^{2} +2\lambda||w||_1\right),$$ with the assumption that: $$\sum_{i=1}^nx_ix_i^T=\diag(\sigma_1^2,...,\sigma_d^2).$$

Is there a closed-form solution in this case?

I have that: $$(X^TX)^{-1}=\diag\left(\sigma_1^{-2},...,\sigma_d^{-2}\right),$$ and so I think the answer is: $$w\,^j=y\,^j\max\left\{0,1-\lambda \frac{n}{|y^j|}\right\},$$ for $y\,^j=\displaystyle\sum_{i=1}^n\frac{y_ix_i\,^j}{\sigma_i^2}$ , but I'm not sure.

score 12 · Accepted Answer · edited Apr 13 '17 at 12:44

I'm going to go through @cardinal's derivation of the closed form lasso solution when $X^T X = I$, found here, with minor modifications.

I'll be assuming that $\sigma^2_i > 0$ for all $i$. This is justified because if we have a $\sigma^2_i = 0$ then this tells us that the $i$th column of $X$ is all 0, and I think it's reasonable to exclude such a case. I'll let $X^T X = D$. Note that this also means that $X$ is full rank and the OLS solution $\hat \beta$ is uniquely defined.

I'm also going to modify your notation to better match that in the answer that I'm referencing. To that end, I shall be solving $$ \hat \beta_\lambda = \text{argmin}_{\beta \in \mathbb R^p } \frac 12 \vert \vert Y - X\beta\vert \vert^2_2 + \lambda \vert \vert \beta \vert \vert_1. $$

This is identical to your problem but I can add more details here if you'd like.

Following @cardinal's derivation, we have that we need to solve $$ \hat \beta_\lambda = \text{argmin } \frac 12 (Y^T Y - 2 Y^T X \beta + \beta^T X^T X \beta) + \lambda \vert \vert \beta \vert \vert_1 $$

$$ = \text{argmin } -Y^T X \beta + \frac 12 \beta^T D \beta + \lambda \vert \vert \beta \vert \vert_1. $$

Noting that the OLS solution is $\hat \beta = (X^T X)^{-1} X^T Y = D^{-1}X^T Y$, we have that $$ \hat \beta_\lambda = \text{argmin } -\hat \beta^T D \beta + \frac 12 \beta^T D \beta + \lambda \vert \vert \beta \vert \vert_1 $$

$$ = \text{argmin } \sum_{j=1}^p -\hat \beta_j \beta_j \sigma^2_j + \frac{\sigma^2_j}2 \beta_j^2 + \lambda | \beta_j|. $$

We are optimizing over each $\beta_j$ separately, so we can solve each term of this sum separately. This means we need to minimize $\mathcal L_j$ where $$ \mathcal L_j = -\hat \beta_j \beta_j \sigma^2_j + \frac{\sigma^2_j}2 \beta_j^2 + \lambda | \beta_j|. $$

Following a completely analagous argument to the linked answer, we find that $$ (\hat \beta_\lambda)_j = \mathrm{sgn}(\hat \beta_j) \left(|\hat \beta_j| - \frac{\lambda}{\sigma^2_j}\right)^+. $$

Furthermore, $\hat \beta = D^{-1} X^T Y \implies \hat \beta_j = \frac{X_j^T Y}{\sigma_j^2}$ so we have that $$ \left(|\hat \beta_j| - \frac{\lambda}{\sigma^2_j}\right)^+ = \frac 1 {\sigma^2_j} \left(|X_j^T Y| - \lambda\right)^+ $$

so it turns out that a predictor $X_j$ gets zeroed out exactly when it would if the design matrix was orthonormal, not just orthogonal. So we can see that in this case with $X^T X = D \neq I$, the variable selection is no different than if $X^T X = I$, but the actual coefficients $\hat \beta_\lambda$ are scaled according to the predictor variances.

As a final note, I will turn this solution into one that resembles yours which means we need to multiply $\hat \beta$ by something to get $\hat \beta_\lambda$. If $(\hat \beta_\lambda)_j \neq 0$ then we have that $$ (\hat \beta_\lambda)_j = \text{sgn}(\hat \beta_j) \left( \vert \hat \beta_j \vert - \frac{\lambda}{\sigma^2_j} \right) = \hat \beta_j - \text{sgn}(\hat \beta_j) \frac{\lambda}{\sigma^2_j} $$

$$ = \hat \beta_j \left( 1 - \frac{\lambda}{\sigma^2_j \vert \hat \beta_j \vert} \right) $$

since $\frac{a}{\vert a \vert} = \text{sgn}(a)$.

Noting that $(\hat \beta_\lambda)_j = 0$ exactly when $$ \vert \hat \beta_j \vert - \frac{\lambda}{\sigma^2_j} \leq 0 \iff \vert \hat \beta_j \vert \leq \frac{\lambda}{\sigma^2_j} \iff 1 \leq \frac{\lambda}{\sigma^2_j \vert \hat \beta_j \vert} \iff 1 - \frac{\lambda}{\sigma^2_j \vert \hat \beta_j \vert} \leq 0, $$

we see that we could alternatively express $\hat \beta_\lambda$ as $$ (\hat \beta_\lambda)_j = \hat \beta_j \left( 1 - \frac{\lambda}{\sigma^2_j \vert \hat \beta_j \vert} \right)^+. $$

So this is very close to what you had but not exactly the same.

I always like to check derivations like this against well-known libraries if possible, so here's an example in R:

## generating `x`
set.seed(1)
n = 1000
p = 5
sigma2s = 1:p
x = svd(matrix(rnorm(n * p), n, p))$u %*% diag(sqrt(sigma2s))

## check this
# t(x) %*% x

## generating `y`
betas = 1:p
y = x %*% betas + rnorm(nrow(x), 0, .5)

lambda = 2

## using a well-known library to fit lasso
library(penalized)
penalized(y, x, lambda1 = lambda)@penalized


## using closed form solution
betahat = lm(y ~ x - 1)$coef
ifelse(betahat > 0, 1, -1) * sapply(abs(betahat) - lambda / sigma2s, function(v) max(c(0, v)))

Can you please elaborate why is $(\hat \beta_\lambda)_j = \mathrm{sgn}(\hat \beta_j) \left(|\hat \beta_j| - \frac{\lambda}{\sigma^2_j}\right)^+$? I don't understand the $+$ of $\left(|\hat \beta_j| - \frac{\lambda}{\sigma^2_j}\right)^+$. Can't see why this is true (even in the linked answer). — Dennis, Jun 06 '21 at 20:37
@Dennis I'll focus on the linked answer. Does it make sense why $\text{sgn}(\hat\beta_j) = \text{sgn}(\beta_j)$? Once we know they have the same signs, then we can go case by case. The first case is $\hat\beta_j > 0$ and in this case we know $\beta_j \geq 0$. We can then show that $\beta_j = \hat\beta_j - \gamma$, but since we also know that $\beta_j$ is non-negative, if $\hat\beta_j - \gamma < 0$ then this isn't feasible so we get the closest we can to it, which is just zero. $(\hat\beta_j - \gamma)^+$ is a shorthand for this. Then we do the same for the negative case — jld, Jun 08 '21 at 13:58
Thank you for the help. The only part that confuses me is "so we get the closest we can to it" that you said. Are you sure this is the right explanation? It lucks some formality as far as I understand. Could you elaborate on that part, please? — Dennis, Jun 13 '21 at 10:22

Closed form solution to lasso problem when data matrix is diagonal

1 Answers1