Performance Metrics for Classification Models with an Ordinal Response Variable

Question

Suppose a hospital wants to use a statistical classification model to predict what kind of surgery will be required based on some measured covariates (e.g. height, weight, age, blood pressure, etc.). Suppose there are 5 types of surgeries, in order of increasing severity (i.e. "ordinal") :

Local Anesthesia (same day release)
Full Anesthesia (same day release)
Overnight Stay
48 Hours Monitoring
Long Term Monitoring

Suppose the researches have access to historical data and decide to fit a multi-class classification model (e.g. random forest) to this data - however, now they are interested in studying the misclassification rates of this model. In particular, they want to know "how wrong the model was when it makes a mistake?".

For example:

Case 1: If the patient actually required a Local Anesthesia Surgery and the model predicted Overnight Stay

vs.

Case 2: If the patient actually required a Local Anesthesia Surgery and the model predicted Long Term Monitoring

Even though in both cases the model prediction was incorrect, the prediction in Case 1 was closer to the truth compared to Case 2 : In Case 1, the model was off by 2 levels whereas in Case 2, the modal was off by 4 levels.

My Question: Although it would be relatively straightforward to build a variant of a confusion matrix that showed how severe the misclassification rates are, provided the model prediction - are there any common metrics that can be used to study this? Is this a common modelling practice?

Thanks!

Stephan Kolassa · Answer 1 · 2024-03-01T16:14:52.493

I would definitely not use confusion matrices, misclassification rates, precision, recall or similar metrics for all the reasons explained in relevant threads, which all apply equally to ordinal multiclass problems:

Instead, I would recommend creating probabilistic predictions. If a patient has a 0.9 probability of only requiring Local Anesthesia (and the other possibilities share the remaining probability of 0.1), then our preparations should be very different than if the Local Anesthesia probability was 0.21 (with the other cases again sharing the remaining probability equally) - even though in both cases, Local Anesthesia has the highest probability, so both cases would yield the exact same entry in a confusion matrix. Correctly calibrated probabilistic predictions will be far more useful here (compare the link in the third bullet point), and proper scoring rules are precisely the tools that will reward such well-calibrated probabilistic classifications.

Now, there is no dearth of proper scoring rules for multiclass predictions (I personally like the log score best: Why is LogLoss preferred over other proper scoring rules?). But we want to include the additional structure here, that our outcomes are ordinal.

Enter the Ranked Probability Score. If $\hat{F}$ is our predictive cumulative probability mass function for ordinal outcomes (i.e., out outcomes are ordered by some relation $\succeq$ and at most countable), then

$$ \text{RPS}(\hat{F},y) := \sum_x (\hat{F}(x)-1_{x\succeq y})^2. $$

This is a proper scoring rule (it is minimized in expectation if $\hat{F}$ is the true cumulative PMF), and it respects the ordinal nature of the data: the RPS will prefer a probabilistic forecast with mass "closer" to the true mass. (As such, it is not "local" in the sense of, e.g., the log score.)

Note that there is also a continuous analogue of the RPS, of course called the Continuous Ranked Probability Score, or CRPS. This is commonly used as a proper scoring rule to evaluate probabilistic predictions or forecasts of continuous outcomes, or outcomes that can at least reasonably be treated as continuous.

(Much of an earlier answer removed here, where I was groping in the dark.)

Stephan's answer is definitive. Big picture: just because something is measured in categories does not mean that we should categorize future observations. Instead we quantify tendencies to be in categories. — Frank Harrell, Dec 18 '21 at 13:39

Performance Metrics for Classification Models with an Ordinal Response Variable

1 Answers1

Linked