Information content of examples and undersampling

Question

As I have written in my question "How much undersampling should be done?", I want to predict defaults, where a default is per se really unlikely (average ~ 0.3 percent). My models are not affected by the unequal distribution: It's all about saving computing time.

Undersampling the majority class to a ratio [defaulting/non-defaulting examples] of 1:1 is the same as expressing the believe that I think examples are equally important in increasing the prediction quality.

Does anyone know a reason why/when equal importance could not be the case? Is there literature on this specific topic (I could not find sampling-literature that modeling/computation-oriented)?

Thanks a lot for your help!

you have never accepted an answer -- please consider doing it. — , Mar 10 '11 at 09:28
One case where classes are not of equal importance is when you classify something into A vs. not A (as opposed to A vs. B). For A vs. B (e.g. cat vs. dog) both classes are meaningful and you may try to model both. For A vs. not A (e.g. cat vs. non-cat), the negative class includes everything, so you can't really hope to model it anyway. Another example is if cost of misclassification is different. Say in your example, if I predict a default erroneously and refuse a loan, then I just lost a little profit. If I predict non-default erroneously, I lose a much larger loan amount. — SheldonCooper, Mar 10 '11 at 20:30

score 1 · Answer 1 · answered Mar 10 '11 at 08:41

If I understand, you have a classification problem where you observe $(X_1,Y_1),\dots,(X_n,Y_n)$ ($Y_i=1$ correspond to default and $Y_i=0$ no default), and you have much more $Y_i=0$ than $Y_i=1$. You want to take a subsample $S$ of the $(X_i,Y_i)$ that have $Y_i=0$.

I am not sure I understand all the points in your question but I will address the question of the effect of subsampling, and the other question of assigning equal/non equal importance to classes in classification.

Forecasting probability of default. For predictive purpose, if you use the bayes principle, you migh construct, for $i=0,1$ an estimate $\hat{f}_i(x)$ of the density $f_i$ of $(X|Y=i)$, and estimate the probability that $Y=i$ by $\hat{\pi}_i$. This will permit you to estimate the probability of default given $X=x$: $$\hat{P}(x)=\frac{\hat{\pi}_0\hat{f}_0(x)}{\hat{\pi}_1\hat{f}_1(x)+\hat{\pi}_0\hat{f}_0(x)}$$ If you have a correct subsample and if you know the proportions, you only need to ajust $\hat{\pi}_0$ (resp. $\pi_{1}$) of no default (res. of default). The only consequence of subsampling will then be that $\hat{f}_0$ will not be as good as it should. This will affect the most the probability $\hat{P}$ when $\hat{\pi}_1\hat{f}_1(x)>>\hat{\pi}_0\hat{f}_0(x)$, that is when you are certain that there is no default.

Forecasting default If you just want a classification rule (i.e. forecast default/ or no default) then can set up a threshold on the preceding probability which will not necessarily be $1/2$. If you think false default alarm costs more than missed default alarm, then you will take a threshold larger than $1/2$, and if it is the contrary you will use a threshold lower than $1/2$. In this case the effect of subsampling can be deduced from the effect in the case of probability forecast.

Hypothesis testing approach I think it is clear that there are cases when you don't want these errors to be treated with equal importance and setting the threshold in the preceding paragraph my be subjective. I found interesting to use a hypothesis testing approach (i.e. not bayesian) in which you want to construct a decision rule $\psi(x)$ that minimizes

$$ \alpha P_1(\psi(X)= 0) + (1-\alpha) P_0(\psi(X)=1) \;\; [1]$$

where $\alpha$ is a parameter you need to choose to set relative importance of your errors. For example if you don't want to give more importance to one error than to the other you can take $\alpha=1/2$. The rule that minimizes equation [1] above is given by

$$ \psi^*(x)= \left \{ \begin{array}{ccc} 0 & if & \alpha f_0(x)/(\alpha f_0(x)+(1-\alpha)f_1(x))>1/2 \\ 1 & else & \end{array} \right .$$ If you replace, for $i=0,1$ $f_i$ by $\hat{f}_i$ then you get the bayesian classification rule but with the weight $\hat{\pi}_0$ (frequency of default) replace by the real importance $\alpha$ you give to a default.

Information content of examples and undersampling

1 Answers1

Linked