Best practice for subsampling training data and weights (in XGBoost)

Question

I am trying to build an XGBoost model in pycharm and I have a general method question even though it relates to my model of choice (XGBoost). Any kind of general comments on the proper statistical method are appreciated too. My data is very unbalanced: 12,114 ones and 1,732,081 zeros (which I then split and use 60% for training). I am not clear, however, how to account for the disbalance in the two sets.

My solution so far I have used the class weights and I think that is fine (judging from the ROC and the calibration plot). That is:

class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=np.ravel(y_train))
hyper_param_dict =  {"class_weight": {0: class_weights[0], 1: class_weights[1]} ,
                # other model parameters not relevant for the question:
                "num_workers" : 4 , "max_depth" : 5 , "n_estimators" : 300 , "learning_rate": 0.1 ,  "gamma": 1  
              } 
xgb_classifier = SparkXGBClassifier(label_col='label', features_col = 'features', **hyper_param_dict)

My problem Training the model like this takes 8 hours which led me to the thought of using subsamples. Here I am a bit confused what the proper way to do so is.

One option would be to subsample in the same ratio, but then I am afraid that I will have too few 1's and the model will not learn them well.
The other option (which I thought is better if there is an implementation option) was to build equally-sized sets of ones and zeros with 7,268 elements each. Speed-wise, this only takes half an hour, but I don't think I implemented the weights correctly and I am afraid of overfitting the ones. My thinking would be that I have to input in the model both the ratio 0:1's in the full training data and in the subsample, but I don't know where. I saw for sklearn (here) that there is an option 'subsample' where I can input the subsample ratio but I have no idea whether that is useful and/or how to use it.

EDIT to make this question more precise and not sound as if I wonder whether unbalanced sets are a problem: I care about TPR, I want to predict my 1's fairly well, although this is not my sole purpose (and I care very much about precision too). I want to make my model as small as possible, as it currently takes a very long time to train. So, I am asking which would be the better way to subsample and what can I use to evaluate and/or decide on an optimal trade-off and minimal training data size.

Thanks in advance for the help :)

Are unbalanced datasets problematic, and (how) does oversampling (purport to) help? — Dave, Feb 12 '24 at 16:03
Hi @Dave, thanks for the reference. I understand that having fewer observations will generally impact negatively my model, but still because of the really long training times I don't think it's feasible to keep re-training the model with the full data. That is why I had the question. What to do if I need to make my training data smaller? AND how to implement that in my model with the options which are already built-in. I have been using the Brier score loss and ROC curves, but would there be another (better) way to assess the tradeoff as esp. ROC curves don't seem to change? — Magi, Feb 13 '24 at 08:27
"I have used the class weights and I think that is fine (judging from [...] the calibration plot)." This is surprising; class weights should skew the predicted probabilities significantly. — Ben Reiniger, Feb 13 '24 at 13:44

Best practice for subsampling training data and weights (in XGBoost)

0 Answers0