I have found a Python function that calculates the decomposition of various proper scoring rule, such as Brier score and log loss. However, it does not seem to accept arrays as arguments, so if I want to work with this function, I seem to be limited to binary outcomes...
...but I want to know about more than two categories!
I can code up a possible hack around this deficiency in the function, however, by looping over the various categorical outcomes and predicted probabilities.
import pandas as pd
import numpy as np
from model_diagnostics.scoring import decompose, SquaredError
np.random.seed(2023)
Matrix of truth: columns indicate the three categories
Five observations, each of which can belong to one of three categories
y_true = np.array([
[0, 1, 0],
[0, 1, 0],
[1, 0, 0],
[0, 0, 1],
[0, 0, 1]
])
Matrix of predictions
y_pred = np.array([
[0.2, 0.5, 0.3],
[0.1, 0.8, 0.1],
[0.7, 0.1, 0.2],
[0.6, 0.1, 0.3],
[0.2, 0.3, 0.5]
])
'''
Try to do it with the arrays, but it fails
decompose(
y_true,
y_pred,
scoring_function = SquaredError()
)
'''
Loop over the categories to get the "decompose" information for each category
dfs = []
for i in range(3):
d = decompose(
y_true[:, i],
y_pred[:, i],
scoring_function = SquaredError()
)
# Make a data frame with one row for each category
#
df_now = pd.DataFrame()
df_now["N"] = [np.sum(y_true[:, i])]
df_now["miscalibration"] = d["miscalibration"]
df_now["discrimination"] = d["discrimination"]
df_now["uncertainty"] = d["uncertainty"]
df_now["score"] = d["score"]
# Save that data frame to a list
#
dfs.append(df_now)
Concatenate all of the one-row data frames
df = pd.concat(dfs).reset_index(drop = True)
Averages of the four columns, weighted by the instances of each category
miscalibration = np.sum(df["miscalibration"] * df["N"])/np.sum(df["N"])
discrimination = np.sum(df["discrimination"] * df["N"])/np.sum(df["N"])
uncertainty = np.sum(df["uncertainty"] * df["N"])/np.sum(df["N"])
score = np.sum(df["score"] * df["N"])/np.sum(df["N"])
When I hack around and take the average of each component (miscalibration, discrimination, uncertainty, and total score) weighted by the number of instances of each category, do I get the correct result for each component as if I had calculated the "right" way with a function that takes arrays as arguments? If not, is there some other way to find the decomposition for the whole data set just using the results from going category by category?
(Bonus: does the story change for other scoring rules or for multi-label problems where multiple categories can be observed?)
scoring_function = LogLoss()and compare thescoreto thelog_lossfromsklearn.metrics, my answers disagree. Perhaps I just made a typo in my code, but I think I have just made a misappointing finding. – Dave Nov 29 '23 at 21:21