XGBoost Probabilities are unrealistic

Question

I have a model that uses XGBoost to predict a binary classification. However I am getting probability outputs for my model prediction on certain datasets that are quite unrealistic: probabilities that are close to 100%, that I know for a fact should not make sense with domain knowledge. Is this often a sign of overfitting in my data? I've compared training and testing accuracy via plots and they seem to not diverge as much, which leads me to believe that the issue is not most likely overfitting. In addition my confusion matrix is providing adequate results (accuracy:73%, f-1: 75%, recall: 75%)

I've used GridSearch to optimize hyper parameters, and have used calibration using CalibrationClassifierCV with logistic regression.

The issue overall seems to be that the model is overestimating probabilities by an extreme amount, for certain cases.

What metrics should I test to determine the root of the issue, and what are some ways I can address those issues.

EDIT: After consulting with comments, I've updated my code to optimize log-loss instead of accuracy. Running these statistics these are the metrics I've achieved.

    fraction_of_positives, mean_predicted_value = calibration_curve(y_test, y_pred_proba_cv, n_bins=10)
plt.plot(mean_predicted_value, fraction_of_positives, &quot;s-&quot;, label=&quot;Calibration curve&quot;)
plt.plot([0, 1], [0, 1], &quot;k:&quot;, label=&quot;Perfectly calibrated&quot;)
plt.xlabel(&quot;Mean Predicted Value&quot;)
plt.ylabel(&quot;Fraction of Positives&quot;)
plt.legend(loc=&quot;lower right&quot;)
plt.title(&quot;Calibration Curve&quot;)
plt.show()

    cross_val = RepeatedKFold(n_splits=5, n_repeats=3, random_state=1)
log_loss_scorer = make_scorer(log_loss, greater_is_better=False, needs_proba=True)

log_loss_scores = cross_val_score(calibrated_model, X_test, y_test, scoring=log_loss_scorer, cv=cross_val, n_jobs=-1)

mean_log_loss = -log_loss_scores.mean()
std_log_loss = log_loss_scores.std()

print('Cross-Validation Log Loss: %.3f (%.3f)' % (mean_log_loss, std_log_loss))


Cross-Validation Log Loss: 0.502 (0.016)

You need to monitor logloss (training /test) and so set cross validation metric to log loss (not accuracy). Accuracy does not measure probability accuracy — seanv507, Jun 18 '23 at 08:43
This may be due to an inappropriate error measure. If you try to optimize accuracy, precision, recall, F1 etc., you incentivize miscalibrated and extreme outputs of 0 or 1. See Why is accuracy not the best measure for assessing classification models? Instead, use a proper scoring rule, like the log or the Brier score. More information can be found here. — Stephan Kolassa, Jun 18 '23 at 10:46
I've updated my code to now optimize log loss, it seems to provide better results. I also added in my edit some metrics I have, my calibration curve seems to be very shifty around the perfect calibration. — user54565, Jun 18 '23 at 23:46
After changing the code to optimize log-loss, and reviewing your edit & comments, it seems that this has addressed your concern. Do you have any remaining questions? — Sycorax, Jun 18 '23 at 23:55
+1 to Sycorax's point and on your own work. That calibration plot looks quite good; monotonic, following diagonal reasonably well. I see no issue there. — usεr11852, Jun 19 '23 at 00:30
Thanks! One last question, for reference, what levels of log loss scores are considered adequate for binary classification? In addition what other metrics can I use besides log loss, calibration, and ROC curve to evaluate a binary classifier. — user54565, Jun 19 '23 at 03:12
@user54565 Those really warrant their own posted questions, though they should be covered elsewhere on here. You might consider checking some of the links in my profile, especially the one about model evaluation. — Dave, Jun 19 '23 at 03:43

Tim · Accepted Answer · 2023-06-19T07:14:15.437

3

You already addressed the problem you were facing by changing the loss function, but as you can learn from Are XGBoost probabilities well-calibrated?, the probabilities returned by XGBoost are in general not well-calibrated. Overestimates are something to be expected. For further improvements, you would need to calibrate them or pick a model that is well-calibrated (e.g. logistic regression).

edited Jun 19 '23 at 07:14

answered Jun 19 '23 at 05:47

Tim

138,066

I don't agree with the answer you refer to, as I don't believe they were using a logloss metric on the boosted trees – seanv507 Jun 19 '23 at 07:36
3

@seanv507 if you check this question on the calibration of neural networks, using log-loss by itself does not need to lead to good calibration. And many other machine learning models suffer from the problem and need calibrating. – Tim Jun 19 '23 at 07:52
The paper referred to points out that the models are regularised using stopped training with accuracy metric. They point out that they train beyond what is optimal based on logloss metric – seanv507 Jun 19 '23 at 08:23

XGBoost Probabilities are unrealistic

1 Answers1