1

Can any one tell me what the effects of $L_2$ loss and smooth $L_1$ loss (i.e. Huber loss with $\alpha = 1$) are, and when to use each of them ?

bradS
  • 1,565
  • 8
  • 19
HOANG GIANG
  • 159
  • 9

1 Answers1

2

First, Huber loss only works in one-dimension as it requires $$\left\|\boldsymbol{a}\right\|_2=\left\|\boldsymbol{a}\right\|_1=\delta$$at the intersection of two functions, which only holds in one-dimension. Norms $L_2$ and $L_1$ are defined for vectors. Therefore, in my opinion, Huber loss better be compared with squared loss rather than $L_2$ loss, since "$L_2$" presumes a multi-dimensional input compared to "squared".

Huber loss is the same as squared loss for differences less than $\delta$, and the same as absolute loss for differences larger than $\delta$, i.e. $$\begin{align*} L_{\delta}(y_n, f_{\theta}(\boldsymbol{x}_n)) =\left\{ \begin{matrix} \frac{1}{2}\left(y_n - f_{\theta}(\boldsymbol{x}_n)\right)^2 & \left|y_n - f(\boldsymbol{x}_n)\right| \leq \delta,\\ \delta\left|y_n - f_{\theta}(\boldsymbol{x}_n)\right| - \frac{1}{2}\delta^2, & \text{otherwise.} \end{matrix} \right. \end{align*}$$

where $y_n$ is the target of data point $n$, and $f_{\theta}(\boldsymbol{x}_n)$ is model's prediction. Note that $L_{\delta}$ has nothing to do with $L_p$ norm, despite the similar notations.

Because of this definition, for large differences due to outliers, gradient of loss function remains constant $\pm \delta$, the same as absolute loss, i.e. $$\frac{\partial \delta\left|y_n - f_{\theta}(\boldsymbol{x}_n)\right|}{\partial \theta_i} = \pm \delta \frac{\partial f_{\theta}(\boldsymbol{x}_n)}{\partial \theta_i}$$ compared to squared loss, where gradient increases with the difference, i.e. $$\frac{\partial \frac{1}{2}\left(y_n - f_{\theta}(\boldsymbol{x}_n)\right)^2}{\partial \theta_i} = -\left(y_n - f_{\theta}(\boldsymbol{x}_n)\right)\frac{\partial f_{\theta}(\boldsymbol{x}_n)}{\partial \theta_i}$$

which leads to large contributions from outliers when we update a parameter solely based on squared loss as follows: $$\begin{align*} {\theta}'_i &=\theta_i + \lambda \sum_n \frac{\partial f_{\theta}(\boldsymbol{x}_n)}{\partial \theta_i}\left(y_n - f_{\theta}(\boldsymbol{x}_n)\right) \\ &= \theta_i + \lambda\sum_{n \notin \text{outliers}} \frac{\partial f_{\theta}(\boldsymbol{x}_n)}{\partial \theta_i}(\text{small}) +\lambda\sum_{n \in \text{outliers}} \frac{\partial f_{\theta}(\boldsymbol{x}_n)}{\partial \theta_i}(\text{large}) \end{align*}$$

It is worth noting that, here, outliers are irregularities in the joint input-output space $(\boldsymbol{x}_n, y_n)$, not necessarily just in the input space $\boldsymbol{x}_n$ as we usually visualize in unsupervised tasks. For example, in a linear trend, none of $(x, y)=\{(1, 2)$, $(5, 10)$, $(10, 20)\}$ are outliers, but $(1, 10)$ is, which leads to large difference $(10 - 2)$ when model expects (predicts) $f_{\theta}(1)=2$.

When to use each of them?

Reminding that we are only talking about one-dimensional targets, Huber loss is a complete replacement for squared loss to deal with outliers. However, the challenge is the choice of $\delta$, which makes it a less favorable "first choice" when we are not yet familiar with the problem at hand. Therefore, we may start with squared loss (or other losses), and after a while try to experiment with Huber loss for different values of $\delta$.

Esmailian
  • 9,312
  • 2
  • 32
  • 48