1

I have a model that uses XGBoost to predict a binary classification. However I am getting probability outputs for my model prediction on certain datasets that are quite unrealistic: probabilities that are close to 100%, that I know for a fact should not make sense with domain knowledge. Is this often a sign of overfitting in my data? I've compared training and testing accuracy via plots and they seem to not diverge as much, which leads me to believe that the issue is not most likely overfitting. In addition my confusion matrix is providing adequate results (accuracy:73%, f-1: 75%, recall: 75%)

I've used GridSearch to optimize hyper parameters, and have used calibration using CalibrationClassifierCV with logistic regression.

The issue overall seems to be that the model is overestimating probabilities by an extreme amount, for certain cases.

What metrics should I test to determine the root of the issue, and what are some ways I can address those issues.

EDIT: After consulting with comments, I've updated my code to optimize log-loss instead of accuracy. Running these statistics these are the metrics I've achieved.

    fraction_of_positives, mean_predicted_value = calibration_curve(y_test, y_pred_proba_cv, n_bins=10)
plt.plot(mean_predicted_value, fraction_of_positives, "s-", label="Calibration curve")
plt.plot([0, 1], [0, 1], "k:", label="Perfectly calibrated")
plt.xlabel("Mean Predicted Value")
plt.ylabel("Fraction of Positives")
plt.legend(loc="lower right")
plt.title("Calibration Curve")
plt.show()

enter image description here

    cross_val = RepeatedKFold(n_splits=5, n_repeats=3, random_state=1)
log_loss_scorer = make_scorer(log_loss, greater_is_better=False, needs_proba=True)

log_loss_scores = cross_val_score(calibrated_model, X_test, y_test, scoring=log_loss_scorer, cv=cross_val, n_jobs=-1)

mean_log_loss = -log_loss_scores.mean()
std_log_loss = log_loss_scores.std()

print('Cross-Validation Log Loss: %.3f (%.3f)' % (mean_log_loss, std_log_loss))

Cross-Validation Log Loss: 0.502 (0.016)

  • 1
    You need to monitor logloss (training /test) and so set cross validation metric to log loss (not accuracy). Accuracy does not measure probability accuracy – seanv507 Jun 18 '23 at 08:43
  • This may be due to an inappropriate error measure. If you try to optimize accuracy, precision, recall, F1 etc., you incentivize miscalibrated and extreme outputs of 0 or 1. See Why is accuracy not the best measure for assessing classification models? Instead, use a proper scoring rule, like the log or the Brier score. More information can be found here. – Stephan Kolassa Jun 18 '23 at 10:46
  • I've updated my code to now optimize log loss, it seems to provide better results. I also added in my edit some metrics I have, my calibration curve seems to be very shifty around the perfect calibration. – user54565 Jun 18 '23 at 23:46
  • 3
    After changing the code to optimize log-loss, and reviewing your edit & comments, it seems that this has addressed your concern. Do you have any remaining questions? – Sycorax Jun 18 '23 at 23:55
  • 2
    +1 to Sycorax's point and on your own work. That calibration plot looks quite good; monotonic, following diagonal reasonably well. I see no issue there. – usεr11852 Jun 19 '23 at 00:30
  • Thanks! One last question, for reference, what levels of log loss scores are considered adequate for binary classification? In addition what other metrics can I use besides log loss, calibration, and ROC curve to evaluate a binary classifier. – user54565 Jun 19 '23 at 03:12
  • @user54565 Those really warrant their own posted questions, though they should be covered elsewhere on here. You might consider checking some of the links in my profile, especially the one about model evaluation. – Dave Jun 19 '23 at 03:43

1 Answers1

3

You already addressed the problem you were facing by changing the loss function, but as you can learn from Are XGBoost probabilities well-calibrated?, the probabilities returned by XGBoost are in general not well-calibrated. Overestimates are something to be expected. For further improvements, you would need to calibrate them or pick a model that is well-calibrated (e.g. logistic regression).

Tim
  • 138,066