Predicted probability from logistic regression higher than actual distribution

Question

I am running logistic models with data where I typically have proportions for the dependent variable that looks like the following:

      0    1
Grp1  0.18 0.82
Grp2  0.24 0.76

Now I run a logistic regression with independent variables Grp (0 if Grp1 and 1 if Grp2) and control variables A, B and C (including a number of interactions) that gives me an estimate of 0.5 for Grp. I then calculate the predicted values for both groups where A, B and C are taken at their mean values, which gives me a probability of 0.82 for Grp2 and 0.88 for Grp1.

I am not sure I understand why it is possible for both groups to have a predicted probability above the actual proportions. Assuming that the calculations are correct (admittedly an ambitious assumption, although I have tried to verify with different ways of calculating the predicted values), is it theoretically possible to have higher predictions for both groups?

Perhaps this Q&A http://stats.stackexchange.com/questions/25389/obtaining-predicted-values-y-1-or-0-from-a-logistic-regression-model-fit?rq=1 will help you. — mdewey, Feb 23 '17 at 12:06
Thanks for the link, but that questions seems more concerned with turning predicted probabilities into integers. I'm more interested in whether it is theoretically possible to have predicted probabilities that are higher than the conditional probabilities calculated before running the model... — avriis, Mar 03 '17 at 10:32

score 2 · Answer 1 · answered Nov 21 '22 at 02:33

Yes, this is possible.

Logistic regression is well-calibrated, meaning that the sum of predicted probabilities equals the sum of the outcomes (assuming coded as 0/1, and assuming the model contains an intercept). This is explained at why in logistic regression the probability mass equal the count. In formulas $$ \sum_i y_i = \sum_i \hat{p}_i $$ which is based on orthogonality to the intercept. Likewise, by orthogonality to the group indicator we can show $$ \sum_{i \in \text{Group 1}} y_i = \sum_{i \in \text{Group 1}} \hat{p}_i $$ and likewise for group 2.

So if you make individual prediction for each $i$, and average those, you will get what you expect. But that is not what you did, you made one prediction for each group, using means of the covariables. Then there is no garantee, and (especially with skewed distributions, I guess) you can be surprised like you.

Predicted probability from logistic regression higher than actual distribution

1 Answers1