Brier score of calibrated probs is worse than non calibrated probs

Question

The question is related to probability calibration and Brier score

I have faced with the following issue. I have Random forest binary classifier and then I apply isotonic regression to calibration of probabilities. The result is the following:

The question: why is Brier score of calibrated probabilities a bit worse than the one of non-calibratied probabilities? Which problem could it be?

Here is python code:

def calibrate_probas(clf, X_train, y_train, X_test, y_test, weights_test, cv):
    probas = clf.predict_proba(X_test)[:, 1]
calibrator = CalibratedClassifierCV(clf, cv=cv, method='isotonic')    
calibrator.fit(X_train, y_train)

calibrated_probas = calibrator.predict_proba(X_test)[:, 1]

clf_score = brier_score_loss(y_test, probas, pos_label=y_test.max(), sample_weight=weights_test)
clf_score_c = brier_score_loss(y_test, calibrated_probas, pos_label=y_test.max(), sample_weight=weights_test)

fop_c, mpv_c = calibration_curve(y_test, calibrated_probas, n_bins=10, normalize=True)
fop, mpv = calibration_curve(y_test, probas, n_bins=10, normalize=True)

# plot perfectly calibrated
f, (ax1, ax2) = plt.subplots(1, 1, figsize=(16, 6))

ax1.plot([0, 1], [0, 1], linestyle='--')

# plot model reliability
ax1.plot(mpv_c, fop_c, marker='.', label='Calibrated')
ax1.plot(mpv, fop, marker='.', c='g', label='Original')
ax1.legend()

title = f'Brier score / Brier score calib: {clf_score} / {clf_score_c}'
ax1.set_title(title)

plt.show()

Unfortunately, I can not provide the data. One of the reason, files are too big. One can see that I do not do nothing special here... Just standard python functions. Where can be the error?

Have you decomposed the Brier score & looked at the components? — gung - Reinstate Monica, Nov 17 '20 at 14:13
Dear @gung-ReinstateMonica , could you, please, explain what you mean by "decomposing the Brier score"? — ABK, Nov 17 '20 at 14:15
ok, from what I can see, there is uncertainty/reliability/resolution tradeoff. Correct? — ABK, Nov 17 '20 at 14:20
still, from the plot it is absolutely unclear why Brier score behaves like this.. — ABK, Nov 17 '20 at 14:21
Does this still happen if you increase the number of bins? Since you're binning by the predicted probability, the two classifiers needn't have the same observations in the same bins, so they may not be directly comparable. — Accidental Statistician, Nov 17 '20 at 14:46
you are calculating the brier score with weights and the graph without. What are the weights? — seanv507, Nov 17 '20 at 16:02
I think @seanv507 got it. You do not use the sample weights for the calibration, but you use it to compute the two bier scores. That calibration is not the one that would minimize the bier score (as you calculate it). — Jacques Wainer, Nov 18 '20 at 01:59
Dear @seanv507 , in our case we use the weights in order to "give more weight" to important samples when we fit the base classifier. In the code above it is clf. — ABK, Nov 18 '20 at 10:13
Dear @AccidentalStatistician , about your question "Does this still happen if you increase the number of bins?". Yes, it does happen. — ABK, Nov 18 '20 at 10:15
OK but the mismatch between your calibration curves and your brier scores is likely due to this. What happens if you generate brier score without weights (don't know if the calibration curve supports sample weights) — seanv507, Nov 18 '20 at 11:57

score 5 · Accepted Answer · answered Nov 18 '20 at 18:51

I have faced with the following issue. I have Random forest binary classifier and then I apply isotonic regression to calibration of probabilities.

The question: why is Brier score of calibrated probabilities a bit worse than the one of non-calibrated probabilities? Which problem could it be?

Whilst the graph shows that the calibrated classifier performs significantly better, the brier score loss is actually slightly better on the uncalibrated model.

Looking at the code, one sees that the brier loss is calculated with sample weights, whilst the calibration and calibration curve are calculated without sample weights.

This explains the appearance of different performance between curve and brier score.

As an aside, I would suggest not using an additional calibration step that needs its own crossvalidation, but searching for RF parameters that give good probabilities see https://arxiv.org/pdf/1812.05792.pdf

Dear @seanv507 , could you, please, explain, why you suggest it? — ABK, Nov 18 '20 at 22:45

Brier score of calibrated probs is worse than non calibrated probs

1 Answers1

Linked