How to measure the accuracy of multinomial target?

Question

I have trained a model that predicts a multinomial output (a.k.a. multi-class classification). Would anyone know how the accuracy can be measured?

The target takes one of the following values: "Yellow", "Red, "Green", or "Blue".

I know that for binomial targets, ROC/AUC provide a good solution. But I haven't found anything for my problem.

Presumably if the target is 'green' then 'red' and 'blue' and 'yellow' are considered equally inaccurate (you don't specify). If that's the case, then you're effectively just in a binomial situation - in each case, either it got in the right category or it didn't. — Glen_b, Jan 23 '14 at 00:49
Thanks a lot for the reply. Yes, in my case any mis-classification is equally inaccurate. However, in case of binomial output, we use the raw pronsities (probability of positive outcome) to consutruct the ROC curve and hence calculate the AUC. In a multinomial case as this one, I do not have raw propensities. So the best thing that I can think of is using the rate of misclassifications to assess a models accuracy. But if anyone knows a stronger method (like AUCs for binomial), please do let me know. — Faz, Jan 24 '14 at 16:45
Can you define 'raw propensity' for me (google was no help)? You may find searching for things like 'multiclass ROC' or 'multiclass AUC' here (and more widely) gets some hits, such as this answer — Glen_b, Jan 24 '14 at 23:25

Dave · Answer 1 · 2023-09-11T17:22:40.263

As you've perhaps noticed, evaluating the area under the ROC curve has an advantage over evaluating "classification" accuracy because it considers the raw outputs (or at least their ranks) instead of the outputs that have been run through a decision rule. However, arguably even better than evaluating the area under the ROC curve and the output ranks is to evaluate the probability predictions themselves. Harrell, for instance, describes the "gold standard" as evaluating the (log) likelihood. In the binary case, this is the binomial likelihood.

For the multi-class case, the likelihood is multinomial. This is related to the categorical crossentropy loss function that you will see in many machine learning packages. In some circles, this is called "negative log likelihood" because of this relationship to multinomial likelihood (though note the comments in the link about this being somewhat of an abuse of terminology).

The drawback of calculating categorical crossentropy loss is that it is hard to interpret. Consequently, it may help to normalize it the way that we normalize mean squared error for $R^2$ in classical linear regression. Directly analogous to McFadden's pseudo $R^2$ discussed by UCLA here is to divide by the categorical crossentropy achieved by a model that always predicts the probability to be the relative class frequencies. That is, if the are $5000$ images of dogs, $5000$ images of cats, and $10000$ images of deer, you would predict $(0.25, 0.25, 0.50)$ every time to calculate the normalizing factor.

$$ R^2_{pseudo} = 1 - \dfrac{ \text{ Categorical crossentropy loss from your model } }{ \text{ Categorical crossentropy loss from always predicting the relative class frequencies } } $$

This will give a value that might be easier to interpret than the crossentropy itself. Values above $0$ indicate stronger performance (in terms of the "gold standard" measure of performance) than the baseline. Values below $0$ indicate worse performance, in some sense, a worthless model. Values very close to $1$ indicate near-perfect performance. This idea can even be applied to out-of-sample testing in order to quell concerns about overfitting.

The idea of comparing to a naïve baseline exists all over in statistics. The classical $R^2$ does it. UCLA above gives the McFadden pseudo $R^2$ that does it for comparing binary crossentropy loss values. Here, I discuss this idea in the context of quantile regression, and sklearn has an implementation of what they call "$D^2$ pinball score" that compares pinball loss scores. Finally, here, a number of us discuss this in the context of defining unpredictability. It is quite natural to do this for categorical crossentropy loss.

How to measure the accuracy of multinomial target?

1 Answers1

Linked