8

The usual method for adapting binary classifiers like various SVMs to multilabel data is one-vs-all, which assumes that labels are independent and in case of a prediction error we don't care what label the incorrect prediction should output.

But suppose I want to predict a score from 1 to 5, and I prefer to err closer to the truth, how do I go about it then? $\lt k$ vs $\ge k$ comes to mind, because in this case labels which are closer together will have larger training data overlap, but is it theoretically sound? Specifically, is overall performance likely to suffer? More generally, what are state-of-the-art techniques I could employ in this situation?

1 Answers1

2

This question seems to repeat itself every now and then (see here, for example), so I'll just summarize the answers and sources that have been accumulated in the time since this question was asked.

Redefine Objective

  1. Ordinal Categorical Classification
    A modified version of the cross-entropy, adjusted to ordinal target. It penalizes the model for predicting a wrong category that is further away from the true category more than for predicting a wrong category that is closer to the true category.

$$l(y,\hat{\ y}) = (1+\omega) * CE(y,\hat{\ y}), \text{ s.t.} $$ $$\omega = \dfrac{|class(\hat{\ y})-class(y)|}{k-1}$$

Where $k$ is the number of possible classes. The operation () the predicted class of an observation that is achieved by arg-maxing the probabilities in a muti-class prediction task. A Keras implementation of the ordinal cross-entropy is available here. An example (taken from the link)

import ordinal_categorical_crossentropy as OCC

model = your model here model.compile(loss=OCC.loss, optimizer='adam', metrics=['accuracy'])

  1. Treat your problem as a regression problem
    Instead of classification task, treat your task as a regression task, and then round your prediction/map them to categories using any kind of method. As Stephan Kolassa mentions, the underlying assumption of this method is that one's scores are interval scaled.

  2. Cumulative Link Loss
    Originally proposed here. It is based on the logistic regression model, but with a link function that maps the logits to the cumulative probabilities of being in a given category or lower. A set of ordered thresholds splits this space into the different classes of the problem.The projections are estimated by the model.
    Under this method there two loss functions are suggested, probit based and logitbased. A keras implementation of the two versions is available here.

  3. Other methods I won't cover
    The following table was taken from here and describe other loss functions for ordinal categorical target.
    On the Consistency of Ordinal Regression Methods

  4. Treat ordinal categories as multi-label problem
    Under this method, we convert our ordinal targets into a matrix such that every class $c$ includes all the first $c$ entries being 1, as the following suggest:

Lowest  -> [1,0,0,0,0]
Low     -> [1,1,0,0,0]
Medium  -> [1,1,1,0,0]
High    -> [1,1,1,1,0]
Highest -> [1,1,1,1,1]

Then the loss function has to be changed as this paper suggest. See this blog-post for an implementation.