I’m working with big, imbalanced data set for a binary classification challenge. Big in the sense that it’s hard to digest all at once, and imbalanced in the sense that for every positive example there are about 1000 negative examples.
My approach:
- reduce size by down-sample the negative examples in a manner that would result with nearly even good to bad examples ratio (completely balanced data)
- do the classic ML pipeline (EDA, feature extraction and modeling)
- "translate" results/performance metrics back to reflect the original non-sampled / imbalanced data.
I perceived this approach as a reasonably good one, as it results with a manageable data set and points out the signal of positive to negative examples difference.
My question: is that a sensible/methodical approach? Are there pitfalls here? When is this a good/bad idea?