1

I am currently working with signal classification problem, and since I can retrieve as many signal samples as I want, I've always chose to construct perfectly balanced datasets. But recently I became to think that maybe it wasn't such a great idea. Maybe by constructing this "ideal" situations I unintentionally decrease my model's generalization? Maybe I should add some class size bias? (~10% for example)
What do you think? At first glance, I didn't find any specific information regarding this topic on the Internet

  • 1
    I think for training it should not matter that much. Actually, if you can draw as many samples as you want you are in a very luxurious situation. In many cases, people downsample or upsample their respective classes, which you do not have to do.

    Where I think, it will matter more is your test set. Your test set should mimic the distribution of the domain you would (hypothetically) deploy your model. So think about, where would you like your model to be used and what ratio of classes would you expect occur there.

    – Janosch Dec 12 '23 at 09:51
  • 2
  • Welcome to Cross Validated! Do you have some sense of the natural prevalence of each of your categories? For instance, if you had to predict the probabilities of a signal belonging to each of your classes, but you weren’t allowed to look at (listen to, etc) the signal, what would you predict? – Dave Dec 12 '23 at 11:31

1 Answers1

2

If you "construct perfectly balanced datasets", then your selection is introducing bias. You may be able to model the differential influence of predictors on class membership probabilities, but you have already biased your estimate of the unconditional class distribution.

Better to take a representative sample, and use appropriate methods. "Class imbalance" is usually not a problem, especially if you can collect data at will: Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?

Stephan Kolassa
  • 123,354