1

Say I have a neural network that classifies images by training to minimise cross-entropy loss with one-hot encoded training labels. It is often seen that such neural networks are 'overconfident', with the softmax of the final layer giving a lot of mass to one class even if it is wrong. However the ultimate concern is usually accuracy, so why is this seen as such a problem? If I understnad correctly, the softmax of the final layer is just some numbers and has no meaningful probabilistic interpretation anyway?

There must be more to the story here, since techniques like label smoothing that encourage better calibration also often lead to better performance in test accuracy, but I don't really understand why that should happen?

  • The output of softmax is a probability vector, so it most certainly does have a probabilistic interpretation! – Sycorax Aug 14 '23 at 13:02
  • @Sycorax Right but it doesn't really make a meaningful statement about probabilities given the data – Danny Duberstein Aug 14 '23 at 13:17
  • 1
    However the ultimate concern is usually accuracy Accuracy has major limitations for both balanced and imbalances problems. Please see this and this and the Frank Harrell blog posts described within: (1) (2). – Dave Aug 14 '23 at 13:20
  • @Dave Thanks for the reference. I guess what I said was imprecise, my point was more that a classifier in use would typically just output one classification, i.e. the argmax of the outputted probabiltiy vector. – Danny Duberstein Aug 14 '23 at 13:40
  • 1
    @DannyDuberstein That’s exactly what those links argue should not be done. Predicted probabilities matter. – Dave Aug 14 '23 at 13:43
  • @DannyDuberstein The probabilities are meaningful if they match the true probabilities, i.e. if they are calibrated. And calibration can be improved by label smoothing! – Sycorax Aug 14 '23 at 14:44
  • Fair play to all – Danny Duberstein Aug 15 '23 at 20:40

1 Answers1

3

Use high stakes to see this.

Say you are predicting the probability of death vs survival. Your model predicts that, if you proceed with a particular course of action, say skydiving, the probability of survival is $0.9999$, so a one-in-ten-thousand chance of dying. That makes the skydiving sound rather safe and worth the small risk for the extreme thrill. However, if the chance of dying is actually one-in-three, the model has mislead the skydiver into thinking this risky activity is low-risk. The skydiver might be willing to jump if the risk of death is one-in-ten-thousand but not one-in-three. That is, the model being overconfident about the probability of survival has led the skydiver to make the wrong decision, even though both the predicted and true probabilities indicate the same outcome (survive) is more likely than the alternative (die).

Dave
  • 62,186