My dataset is high dimensional (sample size is 200 with 300 features) and imbalanced. The imbalance ratio is 80:20 in the training set and 88:12 in the held-out test set (collected at a different time point). I am working on a binary classification problem. The performance on the held-out test set is very low (low recall and precision for the minority class, with an AUC around 50-60%). I am trying multiple machine learning algorithms (e.g. logistsic regression, random forest, XGBoost,...etc), but all models are prone to overfitting regardless of hyperparameter tuning. I am using SMOTETomek to address the training dataset's class imbalance.
- Does the difference in the imbalance ratios in the training & testing sets affect the performance in the held-out test set?
- How can I reduce overfitting and increase my performance?
SMOTETomek? – Dave May 11 '23 at 16:32SMOTETomek. Final evaluation is done on the held-out test set using the best model from cross validation. – Dushi Fdz May 11 '23 at 17:18