Can any one tell me what the effects of $L_2$ loss and smooth $L_1$ loss (i.e. Huber loss with $\alpha = 1$) are, and when to use each of them ?
1 Answers
First, Huber loss only works in one-dimension as it requires $$\left\|\boldsymbol{a}\right\|_2=\left\|\boldsymbol{a}\right\|_1=\delta$$at the intersection of two functions, which only holds in one-dimension. Norms $L_2$ and $L_1$ are defined for vectors. Therefore, in my opinion, Huber loss better be compared with squared loss rather than $L_2$ loss, since "$L_2$" presumes a multi-dimensional input compared to "squared".
Huber loss is the same as squared loss for differences less than $\delta$, and the same as absolute loss for differences larger than $\delta$, i.e. $$\begin{align*} L_{\delta}(y_n, f_{\theta}(\boldsymbol{x}_n)) =\left\{ \begin{matrix} \frac{1}{2}\left(y_n - f_{\theta}(\boldsymbol{x}_n)\right)^2 & \left|y_n - f(\boldsymbol{x}_n)\right| \leq \delta,\\ \delta\left|y_n - f_{\theta}(\boldsymbol{x}_n)\right| - \frac{1}{2}\delta^2, & \text{otherwise.} \end{matrix} \right. \end{align*}$$
where $y_n$ is the target of data point $n$, and $f_{\theta}(\boldsymbol{x}_n)$ is model's prediction. Note that $L_{\delta}$ has nothing to do with $L_p$ norm, despite the similar notations.
Because of this definition, for large differences due to outliers, gradient of loss function remains constant $\pm \delta$, the same as absolute loss, i.e. $$\frac{\partial \delta\left|y_n - f_{\theta}(\boldsymbol{x}_n)\right|}{\partial \theta_i} = \pm \delta \frac{\partial f_{\theta}(\boldsymbol{x}_n)}{\partial \theta_i}$$ compared to squared loss, where gradient increases with the difference, i.e. $$\frac{\partial \frac{1}{2}\left(y_n - f_{\theta}(\boldsymbol{x}_n)\right)^2}{\partial \theta_i} = -\left(y_n - f_{\theta}(\boldsymbol{x}_n)\right)\frac{\partial f_{\theta}(\boldsymbol{x}_n)}{\partial \theta_i}$$
which leads to large contributions from outliers when we update a parameter solely based on squared loss as follows: $$\begin{align*} {\theta}'_i &=\theta_i + \lambda \sum_n \frac{\partial f_{\theta}(\boldsymbol{x}_n)}{\partial \theta_i}\left(y_n - f_{\theta}(\boldsymbol{x}_n)\right) \\ &= \theta_i + \lambda\sum_{n \notin \text{outliers}} \frac{\partial f_{\theta}(\boldsymbol{x}_n)}{\partial \theta_i}(\text{small}) +\lambda\sum_{n \in \text{outliers}} \frac{\partial f_{\theta}(\boldsymbol{x}_n)}{\partial \theta_i}(\text{large}) \end{align*}$$
It is worth noting that, here, outliers are irregularities in the joint input-output space $(\boldsymbol{x}_n, y_n)$, not necessarily just in the input space $\boldsymbol{x}_n$ as we usually visualize in unsupervised tasks. For example, in a linear trend, none of $(x, y)=\{(1, 2)$, $(5, 10)$, $(10, 20)\}$ are outliers, but $(1, 10)$ is, which leads to large difference $(10 - 2)$ when model expects (predicts) $f_{\theta}(1)=2$.
When to use each of them?
Reminding that we are only talking about one-dimensional targets, Huber loss is a complete replacement for squared loss to deal with outliers. However, the challenge is the choice of $\delta$, which makes it a less favorable "first choice" when we are not yet familiar with the problem at hand. Therefore, we may start with squared loss (or other losses), and after a while try to experiment with Huber loss for different values of $\delta$.
- 9,312
- 2
- 32
- 48