Model evaluation metrics for comparing predicted probability accuracy across different datasets?

Question

I'm working on an online model scoring framework, my goal is to be able to understand if my model's predictive performance is degrading week-over-week. I have a classification model (trained on binary target data) which produces probabilities, and I am most interested in assessing the accuracy of those probabilities. For model selection and tuning, i'm using log loss, but I understand that log loss can't be used to compare performance across different datasets (different weeks of predictions). Is there a best practice metric that I can use to effectively compare the accuracy of predicted probabilities, across different datasets? More explicitly, what are appropriate metrics to monitor predicted probability quality over time?

but I understand that log loss can't be used to compare performance across different datasets Welcome to Cross Validated! Could you please expand on this? — Dave, Aug 10 '23 at 19:01
My understanding is that different datasets can have different class distributions, varying complexity, and shifts in feature distributions between datasets can make comparing log loss across different datasets (or for in my scenario, comparing log loss against predictions made over time) an unreliable measure of performance. — Ted, Aug 11 '23 at 18:14

picky_porpoise · Answer 1 · 2023-08-20T19:37:09.530

2

I agree with your statement that comparing log losses $L(p_i) = - \log (p_i)$ across different datasets is not very meaningful if these datasets exhibit different statistical behaviour. See also this post on 'good' log loss values for more background.

Looking at differences

I would suggest to modify the metric you are already using, so that it's interpretation becomes (more) meaningful across datasets. The easiest way would be to switch to loss differences instead of pure losses. If $\bar L$ is the average loss of your model and $\bar L_{base}$ is the average loss of a simple baseline forecast (e.g. frequency of success in last $n$ weeks), then $\bar L_{base} - \bar L$ should be positive most of the time. If you start to see negative values for several weeks, this suggests that your model is no longer useful. Keep in mind that you'd have to check whether the observed sign changes are significant or just bad luck for your model.

Forecast Skill

To refine this idea, you could also scale the average loss difference by the average loss of the baseline model (if it is positive, which it should be for the log loss), i.e. via $$ \bar L_{skill} := \frac{\bar L_{base} - \bar L}{\bar L_{base}} = 1 - \frac{ \bar L}{\bar L_{base}} $$ Then $\bar L_{skill}$ is bounded above by 1 (if your model were perfect) and values below 0 again indicate that your model is no longer useful. Additionally, it quantifies how big the loss difference is compared to the baseline loss and thus gives you a useful interpretation across datasets.

For proper scoring rules (of which the log loss is an example) this quantity is also called skill score, see for instance this paper. In Section 2.3 the authors also state that

If scores [meaning losses in this discussion] for distinct sets of situations are compared, then considerable care must be exercised to separate the confounding effects of intrinsic predictability and predictive performance

which summarizes the reasons why it is probably ill advised to compare losses across different datasets.

Connection to coefficients of determination such as $R^2$

As noted by Dave in the comments, forecast skill relate to McFadden's pseudo $R^2$ and might agree with it, depending on the baseline model. Additionally, if $L$ is the squared error loss and the baseline model is a pure intercept model (i.e. just the mean of the observations), then the in-sample computation of the skill of a model agrees with the usual $R^2$ coefficient of determination from ordinary least squares regression. The same connection arises for the case where $L$ is the absolute loss and the coefficient of determination $R^1$ which is sometimes used in quantile regression.

edited Aug 20 '23 at 19:37

answered Aug 14 '23 at 18:37

picky_porpoise

939

1

Note that this skill score is related to McFadden’s pseudo $R^2$ and equal to McFadden, depending on the baseline model. – Dave Aug 14 '23 at 18:43
Thank you so much! I have a follow up question - would it be meaningful to use something like brier score or absolute loss (or maybe even log loss) to measure model performance over time (not direct comparison)? For example, if absolute loss is increasing week over week, after a few weeks can you make the deduction that there is some underlying drift in the data you're predicting on vs the data your model was trained on and therefore the model needs to be retrained? – Ted Aug 15 '23 at 19:38
@Ted I suppose you can do this, but why use a score or loss for this job? If you are interested in whether your new data are similar to your training data, then comparing simple means, quantiles or doing something like the Kolmogorov–Smirnov test seem more straightforward. – picky_porpoise Aug 15 '23 at 19:56
I will be using those types of metrics to assess drift, but I guess i'm still looking for something to access "if the model is performing well". – Ted Aug 15 '23 at 20:04
@Ted Well how to do this is what my answer is trying to illustrate. Can you explain what else you want to do? – picky_porpoise Aug 15 '23 at 20:11
@picky_porpoise my understanding (unless I am missing something) is that your method would give me the interpretation of "my model is performing (better or worse) compared to a simpler model/approach". It can not be interpreted as "the model is performing objectively well". – Ted Aug 15 '23 at 20:17
@Ted I don't see these two statements as incompatible as it sounds in your answer. Comparing a model to some simple baseline (e.g. via the skill) gives you some statement on the model's performance on a scale from $-\infty$ to 1. I doubt you can find a metric which makes it any more 'objective'. – picky_porpoise Aug 20 '23 at 19:48
@Ted To see this from a different angle, consider the coefficient of determination $R^2$ in OLS regression (see also the new last part of my answer.) It compares the regression model to the simple baseline mean model, in the same manner as the forecast skill does. However, no one is arguing that $R^2$ does not yield an interpretation like "the model is performing objectively well/poorly". – picky_porpoise Aug 20 '23 at 19:48
1

picky_porpoise The double negative in the last sentence makes it quite difficult to understand what you mean. Do you mean that everyone sees the usual $R^2$ as having an objective interpretation of the well/poorly spectrum? // @Ted It’s really hard to make a context-independent judgment about model quality. – Dave Sep 02 '23 at 16:38
@Dave I was trying to say that I think most people see the $R^2$ as a kind of objective measure of model quality and few criticize it for "only" comparing to a reference model. And this is ok, since such a critique is not very solid in my opinion. – picky_porpoise Sep 02 '23 at 18:04

Model evaluation metrics for comparing predicted probability accuracy across different datasets?

1 Answers1

Looking at differences

Forecast Skill

Connection to coefficients of determination such as $R^2$

Linked