Multiclass proper scoring rule decomposition: (weighted) average across the categories?

Question

I have found a Python function that calculates the decomposition of various proper scoring rule, such as Brier score and log loss. However, it does not seem to accept arrays as arguments, so if I want to work with this function, I seem to be limited to binary outcomes...

...but I want to know about more than two categories!

I can code up a possible hack around this deficiency in the function, however, by looping over the various categorical outcomes and predicted probabilities.

import pandas as pd
import numpy as np
from model_diagnostics.scoring import decompose, SquaredError
np.random.seed(2023)
Matrix of truth: columns indicate the three categories
Five observations, each of which can belong to one of three categories

y_true = np.array([
    [0, 1, 0],
    [0, 1, 0],
    [1, 0, 0],
    [0, 0, 1],
    [0, 0, 1]
])
Matrix of predictions

y_pred = np.array([
    [0.2, 0.5, 0.3],
    [0.1, 0.8, 0.1],
    [0.7, 0.1, 0.2],
    [0.6, 0.1, 0.3],
    [0.2, 0.3, 0.5]
])
'''
Try to do it with the arrays, but it fails

decompose(
    y_true,
    y_pred,
    scoring_function = SquaredError()
)
'''
Loop over the categories to get the "decompose" information for each category

dfs = []
for i in range(3):
d = decompose(
    y_true[:, i],
    y_pred[:, i],
    scoring_function = SquaredError()
)

# Make a data frame with one row for each category
#
df_now = pd.DataFrame()
df_now[&quot;N&quot;] = [np.sum(y_true[:, i])]
df_now[&quot;miscalibration&quot;] = d[&quot;miscalibration&quot;]
df_now[&quot;discrimination&quot;] = d[&quot;discrimination&quot;]
df_now[&quot;uncertainty&quot;] = d[&quot;uncertainty&quot;]
df_now[&quot;score&quot;] = d[&quot;score&quot;]

# Save that data frame to a list
#
dfs.append(df_now)


Concatenate all of the one-row data frames

df = pd.concat(dfs).reset_index(drop = True)
Averages of the four columns, weighted by the instances of each category

miscalibration = np.sum(df["miscalibration"] * df["N"])/np.sum(df["N"])
discrimination = np.sum(df["discrimination"] * df["N"])/np.sum(df["N"])
uncertainty = np.sum(df["uncertainty"] * df["N"])/np.sum(df["N"])
score = np.sum(df["score"] * df["N"])/np.sum(df["N"])

When I hack around and take the average of each component (miscalibration, discrimination, uncertainty, and total score) weighted by the number of instances of each category, do I get the correct result for each component as if I had calculated the "right" way with a function that takes arrays as arguments? If not, is there some other way to find the decomposition for the whole data set just using the results from going category by category?

(Bonus: does the story change for other scoring rules or for multi-label problems where multiple categories can be observed?)

I am sorry to say that I think this does not work. When I try it with scoring_function = LogLoss() and compare the score to the log_loss from sklearn.metrics, my answers disagree. Perhaps I just made a typo in my code, but I think I have just made a misappointing finding. — Dave, Nov 29 '23 at 21:21

score 2 · Answer 1 · answered Jan 01 '24 at 21:03

No this does not work. The main problem is that a multiclass score decomposition is hard to define. As explained in this question on Murphy's decomposition of the Brier score, the main idea to understand and define the score decompositions is the definition of a re-calibrated version of the forecast. For instance, if the re-calibration works via simple averaging, we basically get Murphy's original idea, while the more modern approach is to use isotonic regression to re-calibrate.

Generally, such re-calibrated forecasts represent what forecast should have been issued, given the actually issued forecast, i.e. they are given by the distribution of $Y\vert X=F$, where $Y$ is the observation, $X$ the forecast, and $F$ some distribution. In the binary classification setting this simplifies, since we can identify any predicted distribution $F$ with a success probability in $[0,1]$. Hence, re-calibration in the binary case seeks the conditional event probability $ \mathbb{P} (Y =1 \vert X=p)$ for all $p \in [0,1]$.

In the multiclass problem we can now longer rely on this simplification, thus a re-calibration procedure would need to find some other way to estimate the distribution of $Y\vert X=F$, where $F$ is a probability distribution over the $k$ classes. This is a difficult problem which can't be solved via one-dimensional regression approaches as in the binary classification setting. I don't know whether there is a solution for this, and thus I don't think there is an obvious "right" way to compute a score decomposition for the multiclass case. The documentation of the model_diagnostics package suggests that they don't have one either and thus handle only the binary case.

Bonus: The above details are completely independent of what kind of scoring rule you want to decompose and the same problem will appear in a multilabel classification setting.

Multiclass proper scoring rule decomposition: (weighted) average across the categories?

Matrix of truth: columns indicate the three categories

Five observations, each of which can belong to one of three categories

Matrix of predictions

Try to do it with the arrays, but it fails

Loop over the categories to get the "decompose" information for each category

Concatenate all of the one-row data frames

Averages of the four columns, weighted by the instances of each category

1 Answers1