Comparison of ML models

Question

There are two models f, g trained on some labelled (x,y), where y has 2 classes.

During testing they correctly predict the same unseen samples. However, the probability they output are different. So for example f(x1) = [0.3, 0.7] g(x1) = [0.4, 0.6]

so f thinks it is more likely that x1 should be labelled y=1 than g does.

But given that we choose the majority probability for prediction they both choose y=1.

In this case are they considered the same classifier given this current test set? Since the correct label is either y=1 or y=0 with 100% probability. Is there a way to compare the outputs in a different way that takes into account the probabilities?

Have you taken a look at their sensitivity/specificity? Have you plotted ROC curves for both? — Demetri Pananos, Jan 27 '20 at 19:29
You're asking about [tag:scoring-rules]. Continuous scoring rules use the continuous prediction data instead of arbitrarily thresholding it. Examples include log-loss and Brier score. — Sycorax, Jan 27 '20 at 19:47
Yes, and they are exactly the same. Since the confusion matrix is the same. — Sacha Gunaratne, Jan 27 '20 at 19:52
I guess if the threshold from 50% is changed to 65% the confusion matrix of the two would be different. But given that it is at 50% then, they should be considered the same model. — Sacha Gunaratne, Jan 27 '20 at 19:56

score 0 · Answer 1 · answered Apr 10 '23 at 12:28

This is what proper and strictly proper scoring rules do, and they tend to be preferred in statistics over measures like accuracy and $F_1$ score.

Briefly, your models are not the same. One is more confident about the observation belonging to the second class, and it should be rewarded for this confidence if the true observation is the second category; likewise, that model should be penalized more severely for being so overconfident.

Log loss and Brier score are two standard statistics for assessing the probability outputs of machine learning models. Below, $y_i\in\{0,1\}$ are the true obervations, $\hat y_i$ are the predicted probabilities, and $N$ is the sample size.

$$ \text{Log Loss}= -\dfrac{1}{N}\overset{N}{\underset{i=1}{\sum}}\left[ y_i\log(\hat y_i) + (1 - y_i)\log(1 - \hat y_i) \right]\\ \text{Brier Score} = \dfrac{1}{N}\overset{N}{\underset{i=1}{\sum}}\left( y_i - \hat y_i \right)^2 $$

If the true label for that $x_1$ feature vector in the original question is the second category, you will find both of these giving lower (better) values for model $f$. If the true label is the first category, you will find both of these giving lower values for model $g$.

But given that we choose the majority probability for prediction they both choose y=1.

It is common to do this kind of thresholding, but doing so throws away a lot of information. First, it might be that a threshold of $0.5$ is wildly inappropriate for your task, such as if the consequences of mistaking a $0$ for a $1$ are much worse than the consequences of mistaking a $1$ for a $0$. Second, this removes any kind of "grey zone" where the best decision is not to make a decision and collect more data. Yes, a prediction of $0.51$ will be mapped to a particular categorical prediction, but I would like to know that, even if this is the likely outcome, I am on thin ice.

Frank Harrell of Vanderbilt University has two great blog posts that get into this in more detail.

Classification vs. Prediction

Damage Caused by Classification Accuracy and Other Discontinuous Improper Accuracy Scoring Rules

Comparison of ML models

1 Answers1