1

I have an imbalanced data (n = 600, about 97% majority and 3% minority) with 20 features and a binary outcome. The data has been split into a training set and a test set (80%/20%). I used H2o autoML to train a multilayer perceptron on the training set and make predictions on the test set. To try to improve the predictive performance, the minority group in the training set has been oversampled before training the MLP. The predictive performance on the test set did not improve even after oversampling the minority group up to 40%. Any idea why the predictive performance is not improving even after balancing the sizes of the minority and majority groups?

  • 2
    Class imbalance is not inherently a problem and oversampling won't solve a non-problem. (1) https://stats.stackexchange.com/questions/285231/what-problem-does-oversampling-undersampling-and-smote-solve?answertab=scoredesc#tab-top (2) https://stats.stackexchange.com/questions/283170/when-is-unbalanced-data-really-a-problem-in-machine-learning (3) https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he – Sycorax May 27 '22 at 16:17
  • What is "predictive performance" for you? – Ben Reiniger May 27 '22 at 16:20
  • The classification performance on the test set was evaluated using accuracy, AUC and recall – user145331 May 27 '22 at 16:23
  • 2
    The core problem is that you have a binary classification problem and 18 examples of one class. You need more data. Using an ornate model (neural networks) will be prone to dramatic overfitting (without adequate regularization), or the model is simply not informative (because the model is so regularized that its predictions are close to a global average). – Sycorax May 27 '22 at 16:25
  • Do you mean by "more data" a larger sample size? Shouldn't oversampling the minority group using SMOTE help with that? Would tuning the hyperparamteters help? – user145331 May 27 '22 at 16:29
  • 2
    When the problem is that you have less than 2 dozen examples of a class, I've never found that over/under-sampling or SMOTE helps. In this setting, I'd be concerned about estimating a model more complex than a (penalized) logistic regression. – Sycorax May 27 '22 at 16:33
  • 2
    Evaluating the model may also be very unreliable with such a small dataset. The test set would have only about 4 minority class patterns (6000.030.2), so you will get a lot of variability in the performance statistic with different partitions of the data into training and test sets. It also means the AutoML algorithm is likely to run into problems with over-fitting the model selection criterion for much the same reason. – Dikran Marsupial Jun 02 '22 at 15:50
  • If getting more data is not an option, are there any suggestions on how to avoid/reduce performance variability and overfitting? – user145331 Jun 02 '22 at 20:01
  • 1
    @Statwonder I'd try bootstrap and bagging, but with so little data a neural network may be overkill, you may well find a linear classifier works just as well. However, for imbalanced datasets the key is to know what performance metric is the right one for your application (e.g. what are the false-positive and false-negative misclassification costs). An imbalance is not a good reason for resampling, but unequal misclassification costs may be. – Dikran Marsupial Jun 05 '22 at 13:24

0 Answers0