Down sampling the big class in imbalanced data

Question

I’m working with big, imbalanced data set for a binary classification challenge. Big in the sense that it’s hard to digest all at once, and imbalanced in the sense that for every positive example there are about 1000 negative examples.

My approach:

reduce size by down-sample the negative examples in a manner that would result with nearly even good to bad examples ratio (completely balanced data)
do the classic ML pipeline (EDA, feature extraction and modeling)
"translate" results/performance metrics back to reflect the original non-sampled / imbalanced data.

I perceived this approach as a reasonably good one, as it results with a manageable data set and points out the signal of positive to negative examples difference.

My question: is that a sensible/methodical approach? Are there pitfalls here? When is this a good/bad idea?

This is related to the proposed duplicate, yes. However, I do not agree with the closure, as it specifically deals with what to do with hardware limitations. That is not even what is addressed in Dikran Marsupial’s answer in the proposed duplicate. — Dave, May 11 '23 at 16:38

Dave · Answer 1 · 2023-05-11T10:29:41.883

A mantra on here is that class imbalance is not a problem for proper statistical methods, and I agree with this mantra (and have posted variations of it many times). However, class imbalance can result in small amounts of data for the minority class, which is the subject of King and Zeng (2001), discussed here. Their feeling is that, when it is hard to acquire data, it is possible to be smart about undersampling the majority class and then account for that undersampling later. This way, you keep down the cost of data acquisition without sacrificing the count of minority-class members.

You are not at the data-collection stage, but you do face a similar problem in that you need to use discretion about what data to feed into the model because of hardware limitations. Philosophically, I do not see marked differences between these settings, and the ideas from King and Zeng (2001), which are aligned with the idea behind your plan, apply.

If you really are hardware-constrained and cannot work with the entire set of data, your plan seems as reasonable as any. One change might be to downsample only as much as you need to, rather than targeting balanced classes, since the constraint is the data size and not the balance per se. King and Zeng (2001) might be worth a read, too.

(The ideal scenario would be to use better hardware or buy some time on hardware that can handle it (e.g., AWS), but these need not be realistic for all applications, making it necessary to get clever.)

REFERENCE

King, Gary, and Langche Zeng. "Logistic regression in rare events data." Political Analysis 9.2 (2001): 137-163.

score 1 · Answer 2 · answered May 08 '23 at 18:09

People seem to be uncomfortable with deleting any data already collected, even temporarily. In epidemiology, collection of negative cases is often limited to five for each positive case. The theory behind this is that adding a sixth case has very little impact on the standard error of differences between or ratios of statistics from the two kinds of cases.

Subsampling the more common category of cases should be done more often. Choose a ratio that seems sensible and practical from the viewpoint of data analysis and go with it. Five is enough.

If you are not comfortable with five you could also try a larger and smaller number and see if your choice matters much.

If you are concerned about losing information by subsampling then verify any developed models or results either on a larger sample or all the observations.

Supplement any analyses with some descriptive statistics comparing the cases you use with either the total sample or the cases you exclude, checking for surprises or systematic differences. Random sampling should prevent systematic differences.

There is a large literature on this in epidemiology and biostatistics. This is described and discussed in basic methods textbooks in epidemiology.

Down sampling the big class in imbalanced data

2 Answers2

Linked

Related