My aim is to predict quarterly customer-default probabilities: I have data on ~ 2 million individuals, who default on average with a probability of ~ 0.3 percent.
Therefore I am thinking about undersampling the majority class (non-defaults) to save computing-time (kernel methods can be quite costly, I know about correcting the probabability predictions) -- the other option being just taking a sub-sample of the data.
What do you think would be a good ratio of defaults to non-defaults in my learning sample?
Thanks for your help!