How to validate machine learning models with imbalanced datasets?

Question

I'm trying to train a neural network model. Let us suppose that I have a dataset with 4 classes:

Class 1 - 500 samples

Class 2 - 2000 samples

Class 3 - 15000 samples

Class 4 - 60000 samples

In my first approach, I have used downsampling for training my model. Thus, I have selected 400 random samples of each class for training and 50 samples for validation. But I'm not sure about how to test my model.

Should I use all the remaining samples of each class for testing? Or should I test in a balanced way, let's say, using only 50 samples?

Unbalanced classes are almost certainly not a problem, and oversampling or downsampling will not solve a non-problem: Are unbalanced datasets problematic, and (how) does oversampling (purport to) help? — Stephan Kolassa, Jul 06 '21 at 15:17

Jean Américo · Answer 1 · 2021-07-06T21:47:29.983

Well, dealing with unbalanced data requires a lot of attention. Despite the fact that many algorithms rely on the premise of balanced data in the training phase, they can easily make you misinterpret your results.

To validate on unbalanced datasets the most practical way is to downsample it exactly how you did for the training. In your example, you can randomly choose 50 samples of each class for it.

The other non-obvious way is to give different weights for each class when calculating your eval metric, like accuracy. I believe your preferred framework will have functions like balanced accuracy on scikit-learn to handle it for you.

But again, be careful to not forget they are unbalanced when looking for some results, and analyze precision and recall for each class individually.

How to validate machine learning models with imbalanced datasets?

1 Answers1