For a training problem with some loss function $L(w) = \frac{1}{N}\sum_{i=1}^N l(w, x_i, y_i) + \lambda ||w||^2_2$, where $l(w, x_i, y_i)$ is something like least squares and the global minimum of $L(w)$ can always be found, how can I show that decreasing $\lambda$ always decreases $L(w)$?
1 Answers
Suppose $w'$ is the optimal $w$ with a smaller $\lambda$ (say $\lambda'$) and $w$ is the optimal with the larger $\lambda$
We know $L(w',\lambda')\leq L(w,\lambda')$ because $w'$ is optimal (by definition), and $L(w,\lambda')\leq L(w,\lambda)$ because $L$ is increasing in $\lambda$.
The first inequality is actually stronger than you asked for; it says that decreasing $\lambda$ decreases not just the penalised loss $L(w)$ but the unpenalised loss $\frac{1}{N}\sum_i l(w; x_i,y_i)$.
Another way to see this is that minimising $\frac{1}{N}\sum_i l(w; x_i,y_i)+\lambda\|w\|_2^2$ is equivalent to minimising $\frac{1}{N}\sum_i l(w; x_i,y_i)$ subject to some constraint on $\|w\|^2_2$, and that smaller $\lambda$ means a looser constraint. Minimising subject to a looser constraint means a lower minimum.
Unfortunately, this is only for the in-sample loss (or 'apparent loss'). The out-of-sample loss (which is what you usually care most about) can go up or down as $\lambda$ decreases
- 38,062
-
I think that's a very nice answer. Just a small comment : what you call "in-sample loss" (resp. "out-of-sample loss") is what is usually called training error (resp. test error), am I right ? – Camille Gontier May 28 '20 at 08:59
-
If we know that those first two inequalities are true, does that prove that $L(w', \lambda') < L(w, \lambda)$? And how do we know that $L(w, \lambda') < L(w, \lambda)$? – steven-seagull May 28 '20 at 21:11
-
1@CamilleGontier Yes. Or apparent error and actual error – Thomas Lumley May 29 '20 at 03:34
-
@steven-seagull Yes. And we know the second inequality because the only thing that is different on the two sides is lambda, and it's bigger on the right. – Thomas Lumley May 29 '20 at 03:36