Class-Imbalance: How to handle different class distributions in training and held-out test data?

Question

My dataset is high dimensional (sample size is 200 with 300 features) and imbalanced. The imbalance ratio is 80:20 in the training set and 88:12 in the held-out test set (collected at a different time point). I am working on a binary classification problem. The performance on the held-out test set is very low (low recall and precision for the minority class, with an AUC around 50-60%). I am trying multiple machine learning algorithms (e.g. logistsic regression, random forest, XGBoost,...etc), but all models are prone to overfitting regardless of hyperparameter tuning. I am using SMOTETomek to address the training dataset's class imbalance.

Does the difference in the imbalance ratios in the training & testing sets affect the performance in the held-out test set?
How can I reduce overfitting and increase my performance?

@Dave Over-sampling of the minority class. I do 10-fold cross-validation on the 160-sample-sized training dataset (GridSearchCV for hyper-parameter tuning) and created a pipeline to do SMOTETomek. Final evaluation is done on the held-out test set using the best model from cross validation. — Dushi Fdz, May 11 '23 at 17:18
@Dave I read the post you linked but I don't fully understand the final conclusion. So the conclusion is that the oversampling of the minority class in the training set should not be done? Given that I have a very small dataset (high-dimensional), how can I improve the prediction of the minority class? — Dushi Fdz, May 11 '23 at 18:21
The easiest way to catch instances of the minority class is to classify everything as being in the minority class. Then you never miss an instance of the minority class. If this is unacceptable, as I suspect it is, why? // However, the gist of what Kolassa writes in the question is that classification is the wrong goal. You should be estimating probabilities and evaluating those probability estimates. — Dave, May 11 '23 at 18:29

Class-Imbalance: How to handle different class distributions in training and held-out test data?

0 Answers0