I know there's a lot of material on this, but I'm still struggling to find a scenario where cross-entropy loss is better than MSE loss for a multi-class classification problem.
For example, if we have true probabilities as being:
- [1, 0, 0, 0]
and predicted probabilities (after using Softmax) as being:
- [0.6, 0.4, 0, 0]
The cross entropy loss is 0.74, and MSE loss is 0.08.
If we change the predicted probabilities to: [0.4, 0.6, 0, 0], the cross-entropy loss is 1.32, and MSE loss 0.12.
As expected, the cross-entropy loss is higher in the 2nd case because the predicted probability is lower for the true label. However, the MSE loss captures this change by increasing too.
So my question is why do we need cross-entropy loss? MSE loss seems to work fine. Or is it to do with the fact that the cross-entropy loss almost doubled whereas the MSE loss only increased by 0.04? I've tried lots of different examples with different values, but MSE loss and cross-entropy loss either both increase or both decrease (unless there's an example I haven't tried yet).
I know that cross-entropy loss only cares about the probability of the true label, and aims to maximise this. But indiscriminately measuring the distance between all the probabilities (like in MSE loss) indirectly calculates the probability of the true label anyway, so I don't see the point of using cross-entropy loss.