I have seen a lot of different advice on how to deal with imbalance, and I get that it can be case-specific. But I learned in school that SMOTE oversampling or undersampling were basically the ways to fix this, and now in the real world these methods seem to be introducing a lot of problems with bias and uninterpretable probability. I have done a lot of research and also looked at weighting the model, or just changing the threshold downstream. Every time I think I have a good solution to my classification problem (right now trying to find a model for classifiying interested leads for sales when the sales are 12x lower than non-sales), I do more research and run into someone on a blog insisting that this is the wrong way - is there a right and a wrong? I have a large dataset and when I do nothing, the accuracy is great..but it predicts almost all non-sales, of course.
Asked
Active
Viewed 25 times
1
-
3Does this answer your question? Are unbalanced datasets problematic, and (how) does oversampling (purport to) help? If not, what questions remain? You may be interested in some of the links in my profile, several of which relate to this topic. – Dave Aug 09 '23 at 22:22