I am tasked with understanding what's causing a rare manufacturing failure (ca. 1 in 5000) to worsen recently (to ca. 1 in 3000). I have a very large database, so I can get sufficient samples from each class, if I balance them through some sampling algorithm (random, ENN, etc.) - but if I pull a raw/unbalanced set, the enormous datasets quickly get unmanageable computationally.
The question: if I can collect reasonably representative sets containing hundreds or thousands of samples from each class (pass and fail), and my goal is only to offer insight about what features are driving the failure, can I build a classification model on the balanced dataset? I know this model will not accurately predict individual failures in the raw/ unbalanced dataset, but I don't really need to do that now, and I don't have the resources on hand to model such large quantities of data. Will the insights I get from the balanced data be fundamentally - and misleadingly - different from what I'd get on a raw unbalanced set, or will they simply be cruder/blurrier versions? I don't want the former, obviously, but I can deal with the latter.
Alternatively, am I thinking of this all wrong? Should I be using a totally different approach than classification ML models to handle these rare events?