How to modify imbalanced sets to avoid data leakage or overfitting

Question

I have a project for predicting credit card approvals (binary classification).

I got stuck on feature selection, hyperparameter tuning and final testing stages, as I don't know how to properly modify (balance) data to avoid data leakage or overfitting, as my datasets are imbalanced

Sets look as follows:

training_set (highly imbalanced, 112 000 samples ~)
testing_set (highly imbalanced, 55 000 samples ~)
validation set (highly imbalanced, 55 000 samples ~ )

All sets has been preprocessed independently.

As a model I decided to choose XGBoost and f1 score as evaluation metric.

What I want to do:

perform Hyperparameter Tuning on the XGBoost model.
select features using RFE.
Validate result model on the testing set using cross-validation.

I've tried making some manipulations on my own, but ended up either having a bad performing model or data leakage.

How would you handle this type of problem? What sets should I balance and which not?

Artificially balancing the data doesn't solve a real problem. https://stats.stackexchange.com/questions/285231/what-problem-does-oversampling-undersampling-and-smote-solve?noredirect=1&lq=1 — Sycorax, Jul 17 '23 at 13:22
Welcome to Cross Validated! Why do you want to balance at all? — Dave, Jul 17 '23 at 13:22
Unless identifying relevant features is a specific goal of the analysis, you are probably better off not performing feature selection, but just to use regularisation to avoid overfitting. It is very easy to overfit the feature selection criterion. See https://stats.stackexchange.com/questions/27750/feature-selection-and-cross-validation/27751#27751 — Dikran Marsupial, Jul 17 '23 at 13:24
And even if identifying the important features is a major goal, it is not a given that this can be done reliably! // Still, why balance the data at all? — Dave, Jul 17 '23 at 13:27
is this a real project or just a student project? If it's a real project then log loss would be a better criterion, and looking at the probability estimate output (eg for calibration). accurate probabilities are necessary to calculate expected loss and other risk metrics in order to identify correct pricing — seanv507, Jul 17 '23 at 14:22

score 3 · Answer 1 · answered Jul 17 '23 at 13:30

3

Don't balance the dataset. As you are interested in "predicting credit card approvals" you have a task where different types of error have different costs. Giving someone a credit card when you shouldn't (because they have a high risk of defaulting on their debt) is likely to have a higher cost than not giving someone a credit card when you should (as the profit per transaction is likely to be fairly low). Work out plausible values for those misclassification costs and build them into your model selection and performance evaluation criteria. If you do that, normal modelling procedures ought to work fine.

In short, you need to work out what is really important in practice in the application (in the real world) and use that to work out what you need to do. When you evaluate the model make sure your criterion is measuring what is really important (in this case, it isn't accuracy - it is how much profit the credit provider is making).

answered Jul 17 '23 at 13:30

Dikran Marsupial

54,432
9
139
204

3

Something interesting about this is that, if the data concern whether or not someone was approved (this seems to be the case), then the predictive model will only predict if someone will be approved, not if they should be approved. – Dave Jul 17 '23 at 13:32
3

... plus the problem of likely having biased training data - if an applicant looked iffy in whatever system was used before to decide on approvals, they likely did not make it into the training data in the first place. – Stephan Kolassa Jul 17 '23 at 13:35
1

@Dave and Spehan excellent points! - thinking about what we are really trying to achieve is vital, worrying about technical issues is irrelevant if we aren't answering the right question! – Dikran Marsupial Jul 17 '23 at 13:36
thanks yall, I still have a question: if I don't balance any of given sets, will my model overfocus on majority class and will give wrong hyperparameters, hence will lead to poor prediction on testing set? – CraZyCoDer Jul 17 '23 at 13:41
I want to perform feature selection, in order to pass important features via REST Calls to my web application? Is there some other workaround for that? – CraZyCoDer Jul 17 '23 at 13:43
@CraZyCoDer if you build the misclassification costs into the training criterion for the model, and the model selection criterion for tuning hyperparameters, it won't *over*focus. The reason it overfocusses when used with default settings is that those default settings say that false-positives and false-negatives have the same costs. But most classifiers allow this to be changed, e.g. for the support vector machine, have different regularisation constants (C) for each class. – Dikran Marsupial Jul 17 '23 at 14:10
If you build the misclassification costs into the feature selection criterion, you should be O.K., but bear in mind that predictive performance will probably go down rather than up (due to overfitting in model/feature selection). – Dikran Marsupial Jul 17 '23 at 14:12
1

Note that even including the misclassification costs, it may be that the optimal solution is to assign all patterns to the majority class, that is just the correct answer to the question. See https://stats.stackexchange.com/questions/539638/how-do-you-know-that-your-classifier-is-suffering-from-class-imbalance – Dikran Marsupial Jul 17 '23 at 14:13

How to modify imbalanced sets to avoid data leakage or overfitting

1 Answers1