I have a model that uses XGBoost to predict a binary classification. However I am getting probability outputs for my model prediction on certain datasets that are quite unrealistic: probabilities that are close to 100%, that I know for a fact should not make sense with domain knowledge. Is this often a sign of overfitting in my data? I've compared training and testing accuracy via plots and they seem to not diverge as much, which leads me to believe that the issue is not most likely overfitting. In addition my confusion matrix is providing adequate results (accuracy:73%, f-1: 75%, recall: 75%)
I've used GridSearch to optimize hyper parameters, and have used calibration using CalibrationClassifierCV with logistic regression.
The issue overall seems to be that the model is overestimating probabilities by an extreme amount, for certain cases.
What metrics should I test to determine the root of the issue, and what are some ways I can address those issues.
EDIT: After consulting with comments, I've updated my code to optimize log-loss instead of accuracy. Running these statistics these are the metrics I've achieved.
fraction_of_positives, mean_predicted_value = calibration_curve(y_test, y_pred_proba_cv, n_bins=10)
plt.plot(mean_predicted_value, fraction_of_positives, "s-", label="Calibration curve")
plt.plot([0, 1], [0, 1], "k:", label="Perfectly calibrated")
plt.xlabel("Mean Predicted Value")
plt.ylabel("Fraction of Positives")
plt.legend(loc="lower right")
plt.title("Calibration Curve")
plt.show()
cross_val = RepeatedKFold(n_splits=5, n_repeats=3, random_state=1)
log_loss_scorer = make_scorer(log_loss, greater_is_better=False, needs_proba=True)
log_loss_scores = cross_val_score(calibrated_model, X_test, y_test, scoring=log_loss_scorer, cv=cross_val, n_jobs=-1)
mean_log_loss = -log_loss_scores.mean()
std_log_loss = log_loss_scores.std()
print('Cross-Validation Log Loss: %.3f (%.3f)' % (mean_log_loss, std_log_loss))
Cross-Validation Log Loss: 0.502 (0.016)
