I have a project for predicting credit card approvals (binary classification).
I got stuck on feature selection, hyperparameter tuning and final testing stages, as I don't know how to properly modify (balance) data to avoid data leakage or overfitting, as my datasets are imbalanced
Sets look as follows:
training_set (highly imbalanced, 112 000 samples ~)
testing_set (highly imbalanced, 55 000 samples ~)
validation set (highly imbalanced, 55 000 samples ~ )
All sets has been preprocessed independently.
As a model I decided to choose XGBoost and f1 score as evaluation metric.
What I want to do:
- perform Hyperparameter Tuning on the XGBoost model.
- select features using RFE.
- Validate result model on the testing set using cross-validation.
I've tried making some manipulations on my own, but ended up either having a bad performing model or data leakage.
How would you handle this type of problem? What sets should I balance and which not?