7

Assume that I have 10 classes with 100 samples for each class—same # of samples, perfect balanced dataset.

I want to add 3 new classes, and which of the following is the best option for the number of samples for each newly added class?

  1. 100, 100, 100
  2. 200, 200, 200
  3. 1000, 1000, 1000
  4. 100, 1000, 1000

I am creating a data analysis product for my clients, and I have to make a threshold for the minimum (or maximum) number of samples they have to add.

It depends on the datasets we are using, but balanced data is almost always better than imbalanced.

However, I am pretty not sure how I can set the threshold for the newly added class' data sample number if I can choose it.

mkt
  • 18,245
  • 11
  • 73
  • 172

4 Answers4

12

All else being equal, more data is always better. So #3 is clearly the best option.

Imbalanced data is not really a problem, and sacrificing more data for balance is throwing away free information (as Stephan Kolassa notes, the cost of data collection could be a concern - I am ignoring that for now).

See the following questions for more detailed discussion about this common misconception:

Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?

When is unbalanced data really a problem in Machine Learning?

Does an unbalanced sample matter when doing logistic regression?

What is the root cause of the class imbalance problem?

This would be a more difficult choice if instead of [1000, 1000, 1000], option #3 was something like [10, 1000, 1000]. In that case, it is arguable whether you would learn enough about that one class from 10 samples to make the additional benefit of 1000 samples from the other 2 classes worth it - so [200, 200, 200] or [100, 1000, 1000] might be better options.

mkt
  • 18,245
  • 11
  • 73
  • 172
  • 6
    +1. Of course, collecting data is always costly, so it's not like the additional data will come for free. It's just that this cost should be much more important in the decision to be made, rather than any concerns about imbalance. – Stephan Kolassa Jul 21 '22 at 10:06
5

More data per group is always better than less data. It doesn't matter what sample sizes are off other groups.

The imbalance "problem" means that if you can collect only 1000 data points, it's usually better to have 500:500 than 100:900, but 100:900 will be better than 100:100 simply because there is more information on the data. It doesn't matter what's the balance. So an additional value of a data point is lower if you already have many data points from that class, but their value is never negative.

Some models and measures do have problems with unbalanced data, but that's just a modeling and competence issue, others are completely fine. There are many threads about it already on this site, but you are still better off collecting more data than less.

rep_ho
  • 7,589
  • 1
  • 27
  • 50
1

You have a trade-off between wanting the data to be balanced and preferring more data. As you said, it always depends on the data. Moreover, some metrics are more robust towards imbalance than others. Without any further information, I would choose option 1 or 2.

You could also try to augment your dataset by oversampling, or use metrics that are more robust towards imbalance.

frank
  • 10,797
1

Presumably there is some analysis or model training you intend to do with this data.

Depending on what that is, there may be an a priori way to know how many samples you need (power analysis).

Even if there isn't a closed form solution, you could use simulation to get some idea of how much data is needed. You could generate some plausible looking classes and see how your method performs with 50, 100, 200, etc samples.