Are XGBoost probabilities well-calibrated?

Question

In general, can you say anything about how well are the probabilities returned by XGBoost are calibrated? Is it true that, because XGBoost directly optimizes log-loss, probabilities are generally well-calibrated?

XGBoost produces likelihoods, not probabilities, does it not/ — user78229, Jun 19 '23 at 11:56

usεr11852 · Accepted Answer · 2023-06-19T11:30:33.040

6

No, they are not well-calibrated. The predicted probabilities are likely not outright horrible as we would expect from an SVM classifier but they are not usually very well-calibrated. For that matter the estimated probability deciles are not even guaranteed to be monotonic. In Caruana et al. (2004) "Ensemble Selection from Libraries of Models" boosted trees have some of the "worst" calibration performance scores. Similarly in Niculescu-Mizil & Caruana (2005) "Predicting good probabilities with supervised learning", boosted trees have "the predicted values massed in the center of the histograms, causing a sigmoidal shape in the reliability plots". An important caveat is that these results are most likely referring to AdaBoost behaviour and are not validated again XGBoost (which is published about a decade later) (thank you to @seanv507 for raising this point). That said, empirically newer GBM implementations (XGBoost, LightGBM, etc.) are still prone to a "sigmoidal shape" in their results as model training rewards "over-confident" predictions. (and that's why there is interest in well-calibrated probabilities still) Finally, it's worth pointing out that these findings don't even touch upon the scenarios of up-sampling, down-sampling or re-weighting our data; in those cases, it is very unlikely that our predicted probabilities have a direct interpretation at all.

Do note that "badly" calibrated probabilities are not synonymous with a useless model but I would urge one doing an extra calibration step (i.e. Platt scaling, isotonic regression or beta calibration) if using the raw probabilities is of importance. Similarly, looking at Guo et al. (2017) "On Calibration of Modern Neural Networks" can be helpful as it provides a range of different metrics (Expected Calibration Error (ECE), Maximum Calibration Error (MCE), etc.) that can be used to quantify calibration discrepancies. This paper also touches upon the issue of "over-confidence" too (and also propose their "own" calibration step, temperature scaling).

edited Jun 19 '23 at 11:30

answered May 29 '23 at 01:26

usεr11852

44,125

1

It's not clear, but I believe those early calibration papers are not referring to gradient boosted trees optimised by logloss but rather adaboost type methods. They bundle boosted trees and svm together as 'max margin' methods. So I don't believe those papers are relevant. – seanv507 Jun 19 '23 at 07:20
1

(not your point) but the 2017 nnet paper, refers to the fact that nnets are overtrained by logloss because that achieves higher classification accuracy. – seanv507 Jun 19 '23 at 07:28
Thank you for these comments. You are correct, by boosting the authors most likely refer to Adaboost indeed. That said, empirically "sigmoidal shapes" are still common in GBM results. (I was a bit uncomfortable able to "max margin" statement too.) In that sense, if we accept that GBMs can be similarly "overtrained" on logloss ass NNets, we are still in the same boat. But thanks, it is definitely a reasonable caveat to mention, I will edit the answer. Thanks again for both of your comments. – usεr11852 Jun 19 '23 at 11:04
I appreciate you caveat what you say by noting that these benchmarking exercises don't include xgboost, and what I'm saying is largely covered by the comments made by yourself and seanv507, but the fact that xgboost is well-known to win many kaggle competitions which are judged on logloss, and personal experience of xgboost more often than not being the model which performs best under cross-validation using logloss, suggest that its predictions tend to be pretty well-calibrated, at least compared to other models one might use. – gazza89 Jun 19 '23 at 13:19
BTW, you guys got me thinking so I tried the tutorial from betacal package but with an XGBClassifier instead of an AdaBoostClassifier (betacal implements the beta calibration method mention in the main answer). The raw probability estimates from (default parameters) XGBoost give empirical deciles that are as uncalibrated as than the Adaboost estimates in their example (based on ECE as well as visual inspection). – usεr11852 Jun 19 '23 at 17:35
Could you provide the exact code? My point about nns, was that stopped training didn't use logloss as the metric, but classification accuracy. If one uses logloss to stop training you would get better probability calibration (but worse classification accuracy) – seanv507 Jun 20 '23 at 09:05
So my question is are the default parameters specifying log loss for training, and are you using stopped training with logloss. – seanv507 Jun 20 '23 at 09:23
Just run the tutorial but replace AdaBoostClassifier(n_estimators=200) with XGBClassifier(n_estimators=200), they have some extra plotting/util functions here and there, so it is a bit awkward. Notice they use their own Adaboost implementation too. By default, XGBClassifier uses ”binary:logistic” as its objective. – usεr11852 Jun 20 '23 at 09:54
I have added my optimised code as an answer (don't know how to mention with special chars) – seanv507 Jun 28 '23 at 11:47

seanv507 · Answer 2 · 2023-06-28T11:46:41.667

XGBoost is well calibrated providing you optimise for log_loss (as objective and in hyperparameter search).

ML models tend to "default" to overfitting (as opposed to eg logistic regression, where you default to using just the linear terms - not all possible interactions and power terms etc)

In the below I took an example from the betacal package (as suggested by @usεr11852 ) and used xgboost with default parameters or searching for the best hyperparameters (using cal dataset for calibration and hyperparamater search and test for out of sample). I then compare log loss and calibration curves of default vs log loss optimised hyperparameters.

Whilst calibration helps the default model, it has minimal or even negative effect on logloss for the optimised model. (I treat log_loss also as metric for calibration).

I should note that I'm slightly dubious about the dataset. whilst test logloss ~.15 it sometimes can be as low as .08 (just based on data split)

#%%
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.calibration import calibration_curve
from sklearn.metrics import log_loss
import scipy.stats as stats
from betacal import BetaCalibration
import plotnine as p9
def log_loss_s(target,pred):
    ll=-np.where (target==1,np.log(pred),np.log(1-pred))
    ll_m = ll.mean()
    ll_s = ll.std()
    ll_sem= ll_s / len(ll)
    return {"mean": ll_m, "std": ll_s, "sem": ll_sem}
def random_search(results, params_base,params_dist,iter, best_model=None):
    """ append to results in place"""
    if results:
        results_s = list(sorted(results,key = lambda x: x["best_score"]))
        best_score =results_s[0]["best_score"]
        assert best_score == best_model.best_score
    else:
        best_score = None
    for i in range(iter):
        params = params_base.copy()
        for key,dist in params_dist.items():
            params[key]=dist.rvs()
    xgb = XGBClassifier(**params)
    xgb = xgb_opt.fit(x_train, y_train,eval_set=eval_set,verbose=0)
    if best_score is None or xgb_opt.best_score &lt; best_score:
        best_model=xgb
        best_score=xgb.best_score
    result={&quot;params&quot;:params, &quot;best_score&quot;: xgb.best_score, &quot;best_iteration&quot;: xgb.best_iteration}
    print(i, best_score, result)
    results.append(result)
return best_model




#%%
data = np.genfromtxt('spambase.data', delimiter=',')
target = data[:,-1]
data = data[:,0:-1]
#%%
#%%
np.random.seed(42) # for train /cal/test split
#%%
x_train, x_test, y_train, y_test = train_test_split(data, target, test_size=0.5, stratify=target)
x_cal, x_test, y_cal, y_test = train_test_split(x_test, y_test, test_size=0.5, stratify=y_test)
eval_set=[(x_train,y_train),(x_cal,y_cal)]
#%%
xgb_def = XGBClassifier(n_estimators=200,random_state=23)
xgb_def = xgb_def.fit(x_train, y_train,eval_set=eval_set,verbose=0)
probas_def = xgb_def.predict_proba(x_test)
ll_def = log_loss(y_test,probas_def)
cal_probas_def = xgb_def.predict_proba(x_cal)[:, 1]
ll_cal_cal_def = log_loss(y_cal,cal_probas_def)
bc_def = BetaCalibration(parameters="abm")
bc_def.fit(cal_probas_def.reshape(-1, 1), y_cal)
probas_cal_def = bc_def.predict(probas_def[:,1])
ll_cal_def=log_loss(y_test, probas_cal_def)
print(f"test log loss default : {ll_def:0.4f}, default calibrated: {ll_cal_def:0.4f}, cal log loss default {ll_cal_cal_def:0.4f}")
#%% [markdown]
Search Paramaters
#%%
initialise array once, then repeat random search for different parameters
results=[]
xgb_opt=None
#%%
Define the hyperparameter distributions
params_dist = {
    'max_depth': stats.randint(3, 10),
    'learning_rate': stats.uniform(0.01, 0.3),
    'subsample': stats.uniform(0.5, 0.5),
    "min_child_weight": stats.uniform(0.5,10),
    "gamma": stats.uniform(0,10),
    # "random_state": stats.randint(1, 100),
}
params_base={
    "n_estimators":2000,
    "early_stopping_rounds": 10,
    "random_state":23
}
#%%
iter=20
xgb_opt = random_search(results, params_base,params_dist,iter, xgb_opt)
results_s = list(sorted(results,key = lambda x: x["best_score"]))
best_params=results_s[0]["params"]
best_score = results_s[0]["best_score"]
#%% [markdown]
Load Paramater search results
#%%
results_df=pd.read_parquet("results.parquet")
results=results_df.to_dict(orient="records")
results_s = list(sorted(results,key = lambda x: x["best_score"]))
best_params=results_s[0]["params"]
best_score = results_s[0]["best_score"]
#%%
best_params = {
 'early_stopping_rounds': 10,
 'gamma': None,
 'learning_rate': 0.29621411104032447,
 'max_depth': 5,
 'mim_child_weight': None,
 'min_child_weight': None,
 'n_estimators': 2000,
 'random_state': 21,
 'subsample': 0.8138324091853927}
#%%
xgb_opt = XGBClassifier(**best_params) # warning depends on random seed too
xgb_opt = xgb_opt.fit(x_train, y_train,eval_set=eval_set,verbose=0)
#%% [markdown]
Evaluate best model
#%%
print(f"best_params {best_params}")
probas_opt = xgb_opt.predict_proba(x_test)
ll_opt = log_loss(y_test,probas_opt)
cal_probas_opt = xgb_opt.predict_proba(x_cal)[:, 1]
Fit three-parameter beta calibration
bc_opt = BetaCalibration(parameters="abm")
bc_opt.fit(cal_probas_opt.reshape(-1, 1), y_cal)
probas_cal_opt = bc_opt.predict(probas_opt[:,1])
ll_cal_opt=log_loss(y_test, probas_cal_opt)
print(f"test log loss optimised: {ll_opt}, optimised calibrated: {ll_cal_opt}")
#%%
df_plot=pd.DataFrame({
    "ll optim": probas_opt[:,1],
    "ll optim calibrated": probas_cal_opt,
    "default": probas_def[:,1], 
    "default calibrated": probas_cal_def,
    "actual": y_test}).melt("actual",value_name="predicted")
(
    p9.ggplot(df_plot, p9.aes(x="predicted",y="actual",color="variable",fill="variable")) +
    p9.geom_smooth(span=0.3,method="loess") +
    p9.geom_abline() + 
    p9.ggtitle(f"calibration plot\nlog loss ll-optimised: {ll_opt:0.4f} ll-optimised calibrated: {ll_cal_opt:0.4f}\ndefault {ll_def:0.4f} default calibrated: {ll_cal_def:0.4f}")
)
%%

+1. Good work! So, as I am reading this, this has to do more with underfitting then more than anything else. (Cause as mentioned log_loss is the default loss objective anyway.) (I think I will have to re-write parts of my answer in the next few days.) — usεr11852, Jun 28 '23 at 12:03
I would call it overfitting - eg stopped training suggests only 10 trees (not 200) and max depth 5 (instead of default 6). My claim is that ML models tend to be more overparametrised and so are capable of perfectly classifying the training data.This in line with the Guo et al paper that modern neural nets are much more overparametrised than old neural nets giving better performance on classification metrics, but worse on probability metrics (eg logloss) — seanv507, Jun 28 '23 at 12:27
(+1) Yeah, I actually thinking mentioned this connection with Guo in your answer is worthy because people might miss it. (I don't know if you are in academia or not but this could be a decent empirical little paper.) — usεr11852, Jun 28 '23 at 16:35

Are XGBoost probabilities well-calibrated?

2 Answers2

Search Paramaters

initialise array once, then repeat random search for different parameters

Define the hyperparameter distributions

Load Paramater search results

Evaluate best model

Fit three-parameter beta calibration

%%

Linked