Why is cross entropy loss better than MSE for multi-class classification?

Question

I know there's a lot of material on this, but I'm still struggling to find a scenario where cross-entropy loss is better than MSE loss for a multi-class classification problem.

For example, if we have true probabilities as being:

[1, 0, 0, 0]

and predicted probabilities (after using Softmax) as being:

[0.6, 0.4, 0, 0]

The cross entropy loss is 0.74, and MSE loss is 0.08.

If we change the predicted probabilities to: [0.4, 0.6, 0, 0], the cross-entropy loss is 1.32, and MSE loss 0.12.

As expected, the cross-entropy loss is higher in the 2nd case because the predicted probability is lower for the true label. However, the MSE loss captures this change by increasing too.

So my question is why do we need cross-entropy loss? MSE loss seems to work fine. Or is it to do with the fact that the cross-entropy loss almost doubled whereas the MSE loss only increased by 0.04? I've tried lots of different examples with different values, but MSE loss and cross-entropy loss either both increase or both decrease (unless there's an example I haven't tried yet).

I know that cross-entropy loss only cares about the probability of the true label, and aims to maximise this. But indiscriminately measuring the distance between all the probabilities (like in MSE loss) indirectly calculates the probability of the true label anyway, so I don't see the point of using cross-entropy loss.

I'm not sure I follow. Are you saying that MSE is better because the numerical value of the MSE loss is smaller than the value of the cross-entropy loss? These numerical values aren't directly comparable because they are not on the same scale. MSE is in square units and cross-entropy is in log units. — Sycorax, May 04 '22 at 01:03
@Sycorax What I mean is, both MSE and cross-entropy losses increase when the prediction probability of the true class is reduced. Therefore, I don't understand why we need cross-entropy loss, when MSE does the job already. I assume there's probably an extra feature of cross-entropy I'm overlooking, or a certain situation where it does something that MSE can't do. — fx-85, May 04 '22 at 01:11

score 4 · Answer 1 · answered May 04 '22 at 01:07

Crossentropy loss is equivalent to maximum likelihood estimation in a multinomial logistic regression. Consequently, we get all of the wonderful features of maximum likelihood estimation.

This topic has come up on here before, and one of the passionate responses calls it “silly” to estimate the parameters by minimizing square loss.

MSE, which also goes by the term “Brier score” in this setting, is, however, a strictly proper scoring rule and can be a performance metric worth calculating.

Why is cross entropy loss better than MSE for multi-class classification?

1 Answers1