3

Guo et al (ICML 2017) state the following.

During training, after the model is able to correctly classify (almost) all training samples, NLL can be further minimized by increasing the confidence of predictions. Increased model capacity will lower training NLL, and thus the model will be more (over)confident on average.

("NLL" refers to the binomial or multinomial negative log-likelihood, binary and categorical crossentropy loss, respectively, in some circles.)

I am struggling to understand this. If the categories are easy to distinguish on the available features, measured by the fact that the classification accuracy (at some threshold) is high, then the predicted probabilities should be high. With this in mind, shouldn't the "overconfident" predictions be justifiably confident?

REFERENCES

Guo, Chuan, et al. "On calibration of modern neural networks." International Conference on Machine Learning. PMLR, 2017.

Dave
  • 62,186

1 Answers1

0

Just because a model is highly accurate, it doesn't guarantee a high level of confidence. Take for example the predicted class probabilities for the e.g. $i$th object in a 3-class problem from the softmax function, such as $\hat{\pi}_i=\{0.3,0.6,0.1\}$ with high misclassification rates -- low confidence. We know the predicted class membership is in class 2; however, look at how imprecise the prediction is. Now consider the target prediction $\hat{\pi}_i=\{0.01,0.98,0.01\}$ with low misclassification rates -- high confidence. Class prediction accuracy won't increase for either case, since the final class prediction is based on the maximum probability, not the value of probability. However, the paper you cited looks at the actual log(probabilities) of the prediction, so everything can fall apart for accuracy with certain classifiers. Obviously, some classifiers don't use probabilities and instead use closest distance to a cluster like RBF networks, kernel regression, or support vector machines. Random forests uses the majority class label of test objects that end up in a daughter (tree) node for the assigned class label - very different from distance and probability.

Regarding confidence limits, the 90% confidence intervals implies that accuracy value lies within the lower and upper bound 90% of the time. As you ramp up the limits, i.e. 95% confidence limits, the misclassification rates have to decrease to where you're preferable dealing with target predictions like $\hat{\pi}_i=\{0,1,0\}$. So for increasing confidence, I would suspect you would find fewer predictions, making the classifier less "confident." (apologies for not mentioning "overconfidence" - to me, how can a classifier be overconfident? I use ROC-AUC to compare classifiers, which bundles in sensitivity and specificity).

Last, I don't think you'd be able to apply this approach to a wide array of classifiers, since many don't deal with prediction probabilities. Initially, when I looked at the first paper you cited it looked a little like a boosting approach, which is a technique used for making a "weak classifier" better. However, the approach directly exploits misclassification rates and imprecision, which causes confidence to drop independent of accuracy.

wjktrs
  • 852
  • 1
    "the paper you cited looks at the actual log(probabilities) of the prediction, so everything can fall apart" - I would rather say it's the other way around. The log score is a proper scoring rule. It's rather with accuracy that "everything can fall apart": https://stats.stackexchange.com/q/312780/1352 – Stephan Kolassa Jan 27 '24 at 06:45
  • I meant fall apart for accuracy (so I modified that sentence). log(prob) is better than accuracy. – wjktrs Jan 27 '24 at 16:36