3

Assume multi class classification task where we have 5 labels: 1, 2, 3, 4, 5.
For simplicity, let's assume it is the rating of movies, number of stars.

I am after a loss function which is aware of the values.
Namely for for the case $y_i = 5$ if the prediction is $\hat{y}_{i} = 3$ the loss will smaller when compared to the case $\hat{y}_{i} = 1$. Namely it will punish by how far the classifier was wrong.

Is there a neural network friendly loss function for a classification like this?

Is there such loss function in the context of ensemble of trees? SVM? Maybe something in Scikit Learn?

I found something at Ordinal Categorical Classification but I was wondering if there are more options? For instance, with quadratic punishment or something else.

The formula in Keras Ordinal Categorical Crossentropy Loss Function is given by:

enter image description here

$$ loss(\hat{y}, y) = (1 + w) CE(\hat{y}, y) $$

Where $w = \frac{\left|class(\hat{y}) - class(y)\right|}{k - 1}$, $CE()$ is the the cross entropy, $\hat{y}$ is the probabilities per class of the classifier, $y$ the ground truth probabilities per class and $k$ is the number of classes.
The operation $class(y)$ is basically the $\arg \max$ over the vector which gets the index of the class.

Mark
  • 277
  • I found similar question with similar answer: https://datascience.stackexchange.com/questions/116656/loss-for-ordered-multi-class-data-in-classification/. – Mark Dec 03 '22 at 15:48
  • This question tends to arise every now and then. I summarized the different methods for ordinal classification here, I hope you'll find it helpful. – David Harar Apr 02 '23 at 20:34

1 Answers1

1

The simplest approach would be to posit that your scores are interval scaled and to re-cast your problem as a numerical prediction ("regression") problem instead of as a classification one. Then you can use the Mean Squared Error between your point prediction and the outcome.

Alternatively, if you output a full probabilistic prediction with a predicted CDF of $(\hat{F}_1, \dots, \hat{F}_5=1)$, you can use the Continuous Ranked Probability Score (CRPS, see Gneiting & Raftery, 2007), in a discrete version: for an actual outcome $k$, the CRPS is

$$ \text{CRPS}(\hat{F},k) = \sum_{j=1}^5\big(\hat{F}_j-1(j\geq k)\big)^2. $$

Stephan Kolassa
  • 123,354
  • Indeed one of the approaches I could take and tried was regression with round() operation. Yet pure classification feels more appropriate here. So I wanted to see if there is something out there. By the way, Matlab for instance allows defining the loss function as a matrix, namely we can define any loss per any pair combination. – Mark Nov 23 '22 at 19:34
  • Is the CRPS any different from what I linked to? – Mark Nov 23 '22 at 19:36
  • I don't see a loss function at the link. Did I miss it? Also, whether a matrix for the loss makes sense is not a priori clear: if you want a point prediction, it may make sense to output a prediction of 1.2, rather than one that is restricted to the five categories. (This of course goes away if you use probabilistic predictions, which you would restrict to the categories.) – Stephan Kolassa Nov 23 '22 at 19:43
  • https://github.com/JHart96/keras_ordinal_categorical_crossentropy – Mark Nov 24 '22 at 06:10
  • Yes, that I did see. I am not seeing a definition of how "Ordinal Categorical Crossentropy" is defined. However, in general, "ordinary" cross-entropy is a synonym for the log loss, also known as the likelihood. The CRPS above is more akin to the Brier score. Differences arise especially if we see an outcome our model had predicted a probability of zero for. You may find this interesting, although it's for the binary case: Why is LogLoss preferred over other proper scoring rules? – Stephan Kolassa Nov 24 '22 at 08:24
  • This is what I meant in Matlab, they allow creating a loss function per pair. So you can do something which behave according to the distance. – Mark Nov 24 '22 at 11:56