I am doing a small project to predict the write-off probability of our defaulted customers.
In the original population, the write-off rate is about 0.515. Now, for some reason I have to undersample the population and populate a new data set in which the write-off probability is done to 0.15. Since the undersampling change the original event probability of the population, I need to calibrate the model predictions in order to represent the true probability that a write-off can happen.
Based on the new data, I built a logistic regression model using glm in R. and then I referred to approach discussed in the post "converting predicted probabilities after downsampling to actual probabilities in classification" to do calibrate the outcome. I checked the probability generated by the LR model and its mean is 0.15.
From my understanding, I am supposed to see the average of the new probability very close to 0.515. However, the average/mean of the calibrated probability is 0.64 which is quite different from the original event probability.
My questions are:
- Is my understanding correct? i.e., the mean of the new Prob should be 0.515.
- Based on the calculation in the reference: $$ p = \frac{1}{1+\frac{\left(\frac{1}{\alpha}-1\right)}{\left(\frac{1}{\alpha'}-1\right)} \cdot \left(\frac{1}{p_s}-1\right)}.$$ , where $\alpha$ denote the "original rate" , $\alpha'$ denote the (re/over/under)sampled rate, $p_s$ as the model's output "probability" and $p$ the calibrated probability.
Is it provable that mean of $p$ should be equal to or approximately equal to $\alpha$?
Thanks for your help.