Best way to obtain probabilities and model explanations with imbalanced data

Question

I am currently working on machine learning problem with the following characteristics:

- Data have binary outcomes and are severely imbalanced (positive class is ~0.5% of my sample of ~500,000 data points). - I'm using an XGBoost to estimate probabilities of positive outcomes. - It's important that probabilities predicted by the model are realistic, i.e. of the data points assigned a probability of X%, approximately X% should be positive. - I want use a model explanation framework (currently TreeSHAP) to identify risk factors that contribute to a data point having a relatively high probability of belonging to the positive class.

My question is, given the above, what is the optimal approach to 1) model training, 2) hyperparameter optimization, and 3) resampling/weighing (if any).

Model training

As I understand it, logloss is the correct metric do optimize during training to obtain realistic probabilities. Further, I use stratified cross validation to ensure a relatively stable ratio of positive to negative classes in the train/test samples. Is this the correct approach?

Hyperparameter optimization

I want to choose hyperparameters such that they optimize the right metric. The choice of metric is the root cause of many class imbalance problems (see e.g. [this question][1] and its answers) so I want to choose the right one. This [post on Machine Learning Mastery][2] (scroll to section (3.1.3)) claims that one should use either Brier Score or PR AUC. Is the Brier Score or PR AUC (or some other option) the better choice in my case and why?

Resampling/weighing

Reviewing the posts, the [consensus seems to be that class imbalance is a pseudo-problem][3] and that resampling should not generally be done. There are some that recommend using XGBoost's scale_pos_weight to assign greater weights to the minority (positive) class. However, the XGBoost [documentation][4] advices against this if one wants correct probabilities (as is the case here).

Am I correct in not resampling or weighing the positive class in my case? Will not resampling/weighing reduce the ability of my SHAP scores in identifying risk factors?

I apologize for lumping several questions into one, but the answer to each might depend on the others, so I couldn't see a good way to separate them.

[1]: What is the root cause of the class imbalance problem? [2]: https://machinelearningmastery.com/framework-for-imbalanced-classification-projects/ [3]: Are unbalanced datasets problematic, and (how) does oversampling (purport to) help? [4]: https://xgboost.readthedocs.io/en/latest/tutorials/param_tuning.html#handle-imbalanced-dataset

score 4 · Answer 1 · answered Aug 08 '22 at 23:49

Model training

Using log-loss is a good idea because it is a proper scoring rule. Stratified cross validation also sounds like a good idea, but with so few samples (only about 2,500 in the 500,000) I wonder if stratified cross validation is a good idea or not since you'd basically be testing against a lot of the same observations. Right or not, I certainly find this a justifiable approach.

Hyperparameter Optimiztion

Brier score is preferable to AUC since Brier score is a proper scoring rule. AUC can be insensitive to model improvements. That is to say, your model could improve in so far as the probabilities are more extreme in the right direction, and yet the AUC may fail to detect these.

See Frank Harrell's blog post on the topic re: AUC and its usefulness for model selection. Frank writes...

And there are various ways we may be misled by this measure ( Cook, 2007; Lobo, 2007). As the ROC curve does not use the estimated probabilities themselves, only ranks, it may be insensitive to absolute differences in predicted probabilities. Hence, a well discriminating model can have poor calibration. And perfect calibration is possible with poor discrimination when the range of predicted probabilities is small (as with a homogeneous population case-mix), as discrimination is sensitive to the variance in the predictor variables. Over-fitted models can show both poor discrimination and calibration when validated in new patients. Inferential tests for comparing AUROC are problematic ( Seshan, 2013), and other disadvantages with the AUROC are noted ( Halligan, 2015). For various reasons, the AUROC and the c-index or c-statistic are problematic and of limited value for comparing among tests or models, though unfortunately, still widely used for such.

In short, pick a proper scoring rule and check calibration of the risk estimates rather than optimizing AUC.

Resampling/weighing

If your focus is on good probabilities, DO NOT RESAMPLE THE DATA. This will bias the risk estimates. See my answer here for why.

score 1 · Answer 2 · answered Aug 08 '22 at 20:46

1

There is a difference between finding the best model and finding a model that predicts the correct probabilities.

In your case using the raw data may lead to the model allways predicting the negative class, as the positive class is so rare. So in this case one of your hyper parameters should be what weight to use for the positive class. Usually balanced weights is a pretty good guess.

Using stratified cross validation is a good strategy.

About your shap scores, you want the best predictive model. This will probably require weighting the positive class. The Shap scores give you how much the feature increases the score of the sample, not the probability. So a better predictive model gives more accurate Shap scores.

Finally, if you want calibrated probabilities, just calibrate the probabilities of the best predictive model.

As for metrics, PR AUC and Brier score are ok for fast iteration, but you should look at the PR curve as it contains more information on your model behavior, and not just the summary.

answered Aug 08 '22 at 20:46

leviva

954

“There is a difference between finding the best model and finding a model that predicts the correct probabilities.” How do you figure? – Dave Aug 08 '22 at 20:58
Let's take 2 models for example. Model 1 always predicts the negative class with a score of 99.5%. Model 2 predicts for some cases the negative class with 100% probability and for some other case the positive class with 100% probability, while the actual positive class probability is 50%. Model 1 is correct in predicting the probability, but not very useful. Model 2 is a much better predictive model, it found a case with high probability of the positive class. But it gives wrong probabilities. Which model is more useful? – leviva Aug 08 '22 at 21:21
Your two models are not of the same data. In your description of model 1, you are assuming the same probability of $0.995$ to be negative that is assumed in the question. In your description of model 2, you say that the probability is $0.5$. – Dave Aug 08 '22 at 21:29
I'll try to be more clear. Assume we have a dataset of 1000 points and 1 feature - $x$. there are 995 samples where $x<0, y=0$. there are 5 samples where $x>0, y=1$ model 1 gives $P(y=1|x) = 0.005$ - the probability is correct, but the model never predicts y=1.
model 2 gives $P(y=1|x<0) = 0.1, P(y=1|x>0)=0.9$ - the probability is wrong, but the model predicts with 100% precision y=1 and y=0.

why would this happen? in my example, it probably won't. But in many cases, you can get a good predictive model that gets the probabilities wrong. I would prefer model 2 over model 1.
– leviva Aug 08 '22 at 21:40
And then I'd calibrate the probabilities – leviva Aug 08 '22 at 21:43
By what criteria do you consider model 2 to be performing better? My calculations show model 1 to have a Brier score of $0.004975$ and a log loss of $0.03147907$, both of which are smaller than their model 2 counterparts of $0.01$ and $0.1053605$, respectively. y <- rep(c(0, 1), c(9950, 50)); model1 <- rep(mean(y), length(y)); model2 <- rep(c(0.1, 0.9), c(9950, 50)); brier1 <- mean((y - model1)^2); brier2 <- mean((y - model2)^2); logloss1 <- -mean(y * log(model1) + (1-y) * log(1 - model1)); logloss2 <- -mean(y * log(model2) + (1-y) * log(1 - model2)) – Dave Aug 08 '22 at 21:54
First of all by the simple fact that model 2 actually gives me useful information. But if you want a metric, the PR AUC of model 2 is higher. – leviva Aug 08 '22 at 22:26

Best way to obtain probabilities and model explanations with imbalanced data

Model training

Hyperparameter optimization

Resampling/weighing

2 Answers2

Model training

Hyperparameter Optimiztion

Resampling/weighing