5

My aim is to predict quarterly customer-default probabilities: I have data on ~ 2 million individuals, who default on average with a probability of ~ 0.3 percent.

Therefore I am thinking about undersampling the majority class (non-defaults) to save computing-time (kernel methods can be quite costly, I know about correcting the probabability predictions) -- the other option being just taking a sub-sample of the data.

What do you think would be a good ratio of defaults to non-defaults in my learning sample?

Thanks for your help!

RichardN
  • 401

1 Answers1

2

In a first approximation, 1:1 is a good proportion, but:

  • Some methods are more vulnerable to unequal classes, some are less -- plain decision tree will almost always vote for a much larger class, 1-NN will be not affected at all. It is a good idea to check this out (in literature or by asking here) in context of your problem.
  • You should beware of possible inhomogeneities (let's say hidden "subclasses") -- subsampling may change the proportions between subclasses and thus lead to strange effects. To this end, it is good to at least try few subsampling realizations to have a change to spot possible problems. (Well, I called it a problem, but this may be also a quite enlightening insight in data)
  • I will use logistic-regression-related methods like Generalized additive models and Kernel Logistic Regression, so these are due to the large sample size nearly not affected by unequal classes. – RichardN Feb 15 '11 at 09:39