I am currently working on machine learning problem with the following characteristics:
- Data have binary outcomes and are severely imbalanced (positive class is ~0.5% of my sample of ~500,000 data points). - I'm using an XGBoost to estimate probabilities of positive outcomes. - It's important that probabilities predicted by the model are realistic, i.e. of the data points assigned a probability of X%, approximately X% should be positive. - I want use a model explanation framework (currently TreeSHAP) to identify risk factors that contribute to a data point having a relatively high probability of belonging to the positive class.
My question is, given the above, what is the optimal approach to 1) model training, 2) hyperparameter optimization, and 3) resampling/weighing (if any).
Model training
As I understand it, logloss is the correct metric do optimize during training to obtain realistic probabilities. Further, I use stratified cross validation to ensure a relatively stable ratio of positive to negative classes in the train/test samples. Is this the correct approach?
Hyperparameter optimization
I want to choose hyperparameters such that they optimize the right metric. The choice of metric is the root cause of many class imbalance problems (see e.g. [this question][1] and its answers) so I want to choose the right one. This [post on Machine Learning Mastery][2] (scroll to section (3.1.3)) claims that one should use either Brier Score or PR AUC. Is the Brier Score or PR AUC (or some other option) the better choice in my case and why?
Resampling/weighing
Reviewing the posts, the [consensus seems to be that class imbalance is a pseudo-problem][3] and that resampling should not generally be done.
There are some that recommend using XGBoost's scale_pos_weight to assign greater weights to the minority (positive) class.
However, the XGBoost [documentation][4] advices against this if one wants correct probabilities (as is the case here).
Am I correct in not resampling or weighing the positive class in my case? Will not resampling/weighing reduce the ability of my SHAP scores in identifying risk factors?
I apologize for lumping several questions into one, but the answer to each might depend on the others, so I couldn't see a good way to separate them.
[1]: What is the root cause of the class imbalance problem? [2]: https://machinelearningmastery.com/framework-for-imbalanced-classification-projects/ [3]: Are unbalanced datasets problematic, and (how) does oversampling (purport to) help? [4]: https://xgboost.readthedocs.io/en/latest/tutorials/param_tuning.html#handle-imbalanced-dataset
model 2 gives $P(y=1|x<0) = 0.1, P(y=1|x>0)=0.9$ - the probability is wrong, but the model predicts with 100% precision y=1 and y=0.
why would this happen? in my example, it probably won't. But in many cases, you can get a good predictive model that gets the probabilities wrong. I would prefer model 2 over model 1.
– leviva Aug 08 '22 at 21:40y <- rep(c(0, 1), c(9950, 50)); model1 <- rep(mean(y), length(y)); model2 <- rep(c(0.1, 0.9), c(9950, 50)); brier1 <- mean((y - model1)^2); brier2 <- mean((y - model2)^2); logloss1 <- -mean(y * log(model1) + (1-y) * log(1 - model1)); logloss2 <- -mean(y * log(model2) + (1-y) * log(1 - model2))– Dave Aug 08 '22 at 21:54