Squared Loss for Multilabel Classification

Question

I have a classification problem with $K$ labels. I represent the correct label $y$ of an observation $x$ as a vector $y$ in $R^K$, with entries $y_{k'} = \delta_{kk'}$ if $x$ belongs to class $k$.

Given an observation $x$, I predict its label with a vector $f(x) \in R^K$, where the components $f_k(x)$ satisfy $f_k(x) \in (0,1)$ and $\sum_k f_k(x) = 1$. A larger value of some $f_k(x)$ means that $x$ is more likely to belong to class $k$.

We want to learn the best function $f$ by choosing an appropriate loss function $\ell$. I know that a common choice is the cross-entropy: $\ell(x,y) = -\sum_k y_k \log f_k(x)$. Is squared loss $\ell(x,y) = \frac{1}{2}||y - f(x)||_2^2$ ever used? If so, does it tend to produce classifiers with noticeably distinct performance profiles, compared to cross-entropy?

A comment on a related question warns against the use of squared loss.

It's a question about a general classifier. It could be linear or nonlinear. — John Kleve, Mar 11 '17 at 22:51

Danica · Accepted Answer · 2017-04-24T22:05:55.187

This scheme is called the Brier loss. It is a proper scoring rule, and hence only the optimal classifier is correct, etc. It corresponds, of course, to the $L_2$ distance between the predictive label distribution and the true label distribution (which is a point mass).

Deep learning types these days strongly prefer the cross-entropy loss, which corresponds to the KL divergence $KL( y \| \hat y)$. This will penalize giving very low probabilities to the correct class very harshly, perhaps encouraging a flattening out of predicted probabilities relative to the Brier loss.

Consider a $K$-way classification problem, where your estimate of the probability of the $i$th class is $\hat p_i$. Let $y$ be the correct label for a given instance $x$, and $B = (\hat p_y(x) - 1)^2$ the Brier loss. Then $$\nabla B = 2 (\hat p_y(x) - 1) \nabla \hat p_y(x),$$ whereas if $C(x, y, w) = - \log \hat p_y(x)$ is the cross-entropy loss, then $$\nabla C = - \frac{1}{\hat p_y(x)} \nabla \hat p_y(x).$$ Plotting these:

We can thus see that the cross-entropy really emphasizes wrong values, whereas Brier loss scales just linearly with the probability estimate.

Another interesting property: suppose that there are three categories, with the first one being correct. Cross-entropy would value the predictions $(.8, .2, 0)$ and $(.8, .1, .1)$ equally, whereas Brier loss would prefer the second one. I don't know if that's of huge practical importance, but only caring about the true category seems like a reasonable criterion to me, and that leads to cross-entropy being the only proper scoring rule.

By "gradient signal", do you mean the the norm of the gradient? That is what the links you've provided talk about. — John Kleve, Mar 13 '17 at 20:36
@JohnKleve Yeah, basically. I edited in a little math showing what I think they mean, which more or less just reinforces what I said before about KL versus $L_2$ distances between probability distributions. — Danica, Mar 13 '17 at 22:17

score 1 · Answer 2 · answered Dec 29 '20 at 08:51

Thank @djs for the great answer. Agreeing majority of it, but maybe not the last part. (Had to post another answer due to lack of reputation to comment directly.)

Another interesting property: suppose that there are three categories, with the first one being correct. Cross-entropy would value the predictions $(.8, .2, 0)$ and $(.8, .1, .1)$ equally, whereas Brier loss would prefer the second one. I don't know if that's of huge practical importance, but only caring about the true category seems like a reasonable criterion to me, and that leads to cross-entropy being the only proper scoring rule.

IMO, caring about false categories is actually a valuable feature. In knowledge distillation, utilizing predictions of false categories (so-called "dark knowledge") is one of the underlying principles.

For three categories (dog, cat, car), suppose the true label is "dog", a prediction $(.8, .2, 0)$ is obviously better than a prediction $(.8, .1, .1)$ , because "car" is no where near "dog" and $0.1$ prediction for "car" is hardly reasonable.

Nonetheless, this doesn't make MSE a better loss function for classification. Cross entropy is still preferred.

Squared Loss for Multilabel Classification

2 Answers2