1

I'm trying to train a neural network model. Let us suppose that I have a dataset with 4 classes:

Class 1 - 500 samples

Class 2 - 2000 samples

Class 3 - 15000 samples

Class 4 - 60000 samples

In my first approach, I have used downsampling for training my model. Thus, I have selected 400 random samples of each class for training and 50 samples for validation. But I'm not sure about how to test my model.

Should I use all the remaining samples of each class for testing? Or should I test in a balanced way, let's say, using only 50 samples?

User1865345
  • 8,202
Zaratruta
  • 948
  • 5
  • 15

1 Answers1

-1

Well, dealing with unbalanced data requires a lot of attention. Despite the fact that many algorithms rely on the premise of balanced data in the training phase, they can easily make you misinterpret your results.

To validate on unbalanced datasets the most practical way is to downsample it exactly how you did for the training. In your example, you can randomly choose 50 samples of each class for it.

The other non-obvious way is to give different weights for each class when calculating your eval metric, like accuracy. I believe your preferred framework will have functions like balanced accuracy on scikit-learn to handle it for you.

But again, be careful to not forget they are unbalanced when looking for some results, and analyze precision and recall for each class individually.