Why a well-calibrated model has worse brier score loss?

Question

I already referred this post.Don't mark this as duplicate.

I am working on a binary classification problem using algos like random forest, extra trees and logistic regression. dataset shape is 977, 6. class ratio is 77:23

In terms of our metric of interest f1, random forest seemed to do better followed by extra trees and then last is logistic regression

However, in terms of calibration, I see that logistic regression is well-calibrated (not surprised), followed by extra-trees and last is random forest.

But my question, why does logistic regression have higher brier score loss when compared to random forest (which doesn't have inherent calibration capability as log reg)?

Shouldn't the logistic regression brier score loss be the smallest, followed by extra trees and last is random forest?

Please find the graphs below

Can you add a histogram of predicted probabilities? I think the following might be the case: Logistic regression is well calibrated, but perhaps is not willing to go out on a limb in extreme cases, where as the random forest is. This would reduce the brier score for RF. — Demetri Pananos, Apr 02 '22 at 18:24
Additionally, you should compute the loss for each prediction for each model and compare to see which observations are driving the increase in Brier score. — Demetri Pananos, Apr 02 '22 at 18:27
@Demetri These metrics are computed on the test data and we know from the other questions about this datasets/problem that there is overfitting. So are the results surprising? — dipetkov, Apr 02 '22 at 18:34
@dipetkov - Do you say it is overfitting based on random forest calibration curve for upper estimates? Can let me know why do you think it is overfitting? I am trying to learn because based on confusion matrix, I felt the performance between train and test is comparable. — The Great, Apr 02 '22 at 18:45
These are your own words: "My problem is whatever I do, I see that my model overfits" taken from this post. — dipetkov, Apr 02 '22 at 18:55
yes, agree. Can I have your opinion on what you think of the confusion matrix results? Do you think it is overfitting? can help me with that? — The Great, Apr 02 '22 at 18:56
My advice is to do more labeling. It's the single most effective thing you can do. — dipetkov, Apr 02 '22 at 19:00
So, you think the model is overfitting? I know overfitting criteria is subjective...but does my results while poor does it show any red flags? — The Great, Apr 02 '22 at 19:02
@DemetriPananos - I was going through your tutorial on bootstrap optimism using sklearn https://dpananos.github.io/posts/2021/11/blog-post-34/ - Can you let me know what does the split function do? Sorry, am new to python. Is that split function automatically invoked when we call the OptimisimBootstrap class? Because, in the code below for linear regression, I don't see explicit calls for split function. So, trying to understand how this works. This question mainly due to my limitation with python (and am learning) — The Great, Apr 03 '22 at 03:36
@TheGreat When you pass OptimismBootstrap as an object to cross_validate the split method is called in order go actually get the cross validation folds. It would behoove you to take a peek at sklearn.model_selection.KFold's documentation and source code as I borrowed a lot from there (also, my code is not production worthy, it is only for demonstration purposes). — Demetri Pananos, Apr 03 '22 at 04:07
@DemetriPananos - Thanks for your help and code. I have updated my results here in https://stats.stackexchange.com/questions/570172/bootstrap-optimism-corrected-results-interpretation . Would really be useful to have your opinion/comments on this — The Great, Apr 03 '22 at 04:26

score 1 · Accepted Answer · answered Apr 24 '23 at 17:38

Brier score can be decomposed into measures of calibration and discrimination. Calibration describes the extent to which predicted probabilities align with true event occurrence. That is, if an event that is predicted to happen with probability $0.5$ actually happens $90\%$ of the time, the calibration is poor. Discrimination describes the extent to which model predictions for the two categories can be separated, and the Brier score does well here when the predicted distributions for the two categories are easy to separate (hence the relationship to the ROC AUC discussed in the link).

You have a poor Brier score despite good calibration. This must mean that the ability for a model to discriminate between the two categories is poor.

Is brier score of 20 considered poor? Is there any range for good brier score? — The Great, Apr 25 '23 at 04:48
@TheGreat https://stats.stackexchange.com/questions/414349/is-my-model-any-good-based-on-the-diagnostic-metric-r2-auc-accuracy-rmse/414350#414350 — Dave, Apr 25 '23 at 09:35

Why a well-calibrated model has worse brier score loss?

1 Answers1