1

Let $J(w)$ be some cost function. By adding $L^1$ regularization we get $$ \tilde{J}(w) = J(w) + \beta\sum_i|w_i| $$ To study the effect of $L^1$ regularization on the optimum weights, we can approximate $J(w)$ with a quadratic function: $$ \hat{J}(w) = J(w^*) + \frac{1}{2}(w-w^*)^T H(w^*)(w-w^*), $$ where $w^*$ is the minimum of $J$ and so the gradient is zero. According to the Deep Learning book (see pages 207-208) I'm reading, if we assume that $H = diag(\gamma_1,\ldots,\gamma_N)$, where each $\gamma_i>0$, then

the solution of the minimum of the $L^1$ regularized objective function decomposes into a system of equations of the form: $$ \tilde{J}(w) = \frac{1}{2}\gamma_i(w_i-w_i^*)^2 + \beta|w_i|. $$ It admits an optimal solution (for each dimension $i$), with the following form: $$ w_i = sign(w_i^*)\max(|w_i^*|-\frac{\beta}{\gamma_i},0). $$

I don't understand the quoted part. This is my starting point: $$ \tilde{J}(w) = J(w^*) + \frac{1}{2} \sum_i \gamma_i (w_i - w^*_i)^2 + \beta \sum_i |w_i| $$ Shouldn't we look for points where the gradient is zero? From $$ \frac{\partial\tilde{J}(w)}{\partial w_i} = \gamma_i (w_i - w^*_i) + \beta sign(w_i) = 0, $$ I get $$ w_i = w^*_i - \frac{\beta}{\gamma_i}sign(w_i). $$ What am I doing wrong?

Kiuhnm
  • 377
  • 1
  • In the last two lines, $\text{sign}(w_i)$ should be $\text{sign}(w_i^)$. 2) Consider the situation where $w_i^ = 0.5$, $\beta = \gamma_i = 1$. $w_i = -0.5$ in your formulation, but $0$ in the quote. This is because the derivative changes abruptly (loosely writing) at $0$,.You can see by looking at the original expression that $w_i$ shouldn't get more negative once it's reached zero (coming from a positive direction) as that will increase both terms in $\tilde{J}(w)$. See also http://stats.stackexchange.com/questions/74542/why-does-the-lasso-provide-variable-selection/74569#74569.
  • – jbowman Nov 19 '15 at 20:53
  • @jbowman OK. I found the minima of $J_i$ for $w_i<0$, $w_i>0$ and $w_i=0$. Ignoring $w_i=0$, I get the minimum at $w_i=w_i^-sign(w_i^)\frac{\beta}{\gamma_i}$. Now the mimum is in 0 if $|w_i^*|\leq\frac{\beta}{\gamma_i}$. By putting it all together and after a bit of manipulation I get the correct formula! Thank you both! – Kiuhnm Nov 20 '15 at 02:19