3

I am tasked with understanding what's causing a rare manufacturing failure (ca. 1 in 5000) to worsen recently (to ca. 1 in 3000). I have a very large database, so I can get sufficient samples from each class, if I balance them through some sampling algorithm (random, ENN, etc.) - but if I pull a raw/unbalanced set, the enormous datasets quickly get unmanageable computationally.

The question: if I can collect reasonably representative sets containing hundreds or thousands of samples from each class (pass and fail), and my goal is only to offer insight about what features are driving the failure, can I build a classification model on the balanced dataset? I know this model will not accurately predict individual failures in the raw/ unbalanced dataset, but I don't really need to do that now, and I don't have the resources on hand to model such large quantities of data. Will the insights I get from the balanced data be fundamentally - and misleadingly - different from what I'd get on a raw unbalanced set, or will they simply be cruder/blurrier versions? I don't want the former, obviously, but I can deal with the latter.

Alternatively, am I thinking of this all wrong? Should I be using a totally different approach than classification ML models to handle these rare events?

Ferdi
  • 5,179
  • If the task is to simplify classify on the data you already have, you can use any common classification tool. In several toolkits you can use the same data to try out different algorithms and compare them side-by-side. The training/learning phase is always done on balanced data, since otherwise you'd put an additional weight to the larger fraction. Validation is also done on a balanced set. If it works on a true set is mostly about the quality of the data and the tool (in that order!). There should be a lot of tutorials that have classification examples. – cherub Jun 05 '18 at 13:41
  • @ cherub, thanks. I have tried a few classification methods, and they do suggest interesting relationships; I just worried that I couldn't trust those relationships b/c they arose from artificial balancing of the data. If I understood you correctly, it is not necessarily true that a model built on such balanced data is bad - but it can be, depending on the quality of the data and tool. Is there any way I can check the quality of the data and tool to see if I can trust them or not, or is fitting on an unbalanced test set the only real way? – Helenus the Seer Jun 05 '18 at 14:00
  • @ Helenus: Since your ratio is about one to thousand, just imagine a constant "classifier" which always yields a "not broken". If you determine the error of this classification, you get about "one in a thousand". It's going to be hard to beat that. Depending on the classification method, it "needs" enough examples from each category to "learn" the characteristics or characteristic differences. Regarding the quality of the data -- this opens up completely new (albeit related) field of statistical data analysis; e.g. feature extraction, input preparation, etc. – cherub Jun 05 '18 at 14:12
  • 1
    Preliminary question followed by comments: Are you sure this is a classification problem? It sounds more like a probability prediction problem or (somewhat related) a hazard estimation problem. If it is a probability prediction problem, you don’t have to worry as much about sample unbalance as you do about sample size. Moreover, if you use some popular method for balancing your class sizes, you will build a model that doesn’t estimate the probabilities of interest because it is trained using data that looks nothing like reality. See Frank Harrell’s many posts about this. – Brash Equilibrium Jun 05 '18 at 22:33
  • @ Brash Equilibrium: Thanks, I have heard this issue before, and agree that 1/5000 failure sounds more like probabilistic failure than mechanistic. However, other than choosing logistic regression or other 'classification' tools that can provide class probabilities rather than labels, and avoiding simplistic scoring like overall accuracy, I am unsure how to follow Frank's guidance in practice. Can you recommend a book or other resource explaining what other techniques can be used by someone wanting to follow this approach? I will read those and hopefully understand the distinction better. – Helenus the Seer Jun 05 '18 at 23:54
  • 2
    Does this answer your question? Down sampling the big class in imbalanced data My answer there specifically addresses the issue of having so many majority-class observations per minority-class observation that your hardware cannot handle the massive amount of data needed to get a decent number of minority-class observations. I think that’s exactly the issue in this question. – Dave Sep 09 '23 at 03:43

0 Answers0