I am currently working with signal classification problem, and since I can retrieve as many signal samples as I want, I've always chose to construct perfectly balanced datasets. But recently I became to think that maybe it wasn't such a great idea. Maybe by constructing this "ideal" situations I unintentionally decrease my model's generalization? Maybe I should add some class size bias? (~10% for example)
What do you think? At first glance, I didn't find any specific information regarding this topic on the Internet
Asked
Active
Viewed 35 times
1
Ikaryssik
- 11
1 Answers
2
If you "construct perfectly balanced datasets", then your selection is introducing bias. You may be able to model the differential influence of predictors on class membership probabilities, but you have already biased your estimate of the unconditional class distribution.
Better to take a representative sample, and use appropriate methods. "Class imbalance" is usually not a problem, especially if you can collect data at will: Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?
Stephan Kolassa
- 123,354
Where I think, it will matter more is your test set. Your test set should mimic the distribution of the domain you would (hypothetically) deploy your model. So think about, where would you like your model to be used and what ratio of classes would you expect occur there.
– Janosch Dec 12 '23 at 09:51