Why is the confidence interval for difference in two proportions inconsistent with the p-value for this difference in R prop.test()?

Question

I calculated the difference in 2 proportions using the prop.test() using some exemplary data I found in internet.

> (ppt <- prop.test(x = c(11, 8), n = c(16, 21),correct = FALSE))
2-sample test for equality of proportions without continuity correction


data:  c(11, 8) out of c(16, 21)
X-squared = 3.4159, df = 1, p-value = 0.06457
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.001220547  0.614315785
sample estimates:
   prop 1    prop 2 
0.6875000 0.3809524

The p-value is 0.06457. The confidence interval for the difference in proportions is -0.001220547 to 0.614315785.

If I calculate the p-value from the CI manually using the normal distribution:

> z_score <- qnorm((1 + 0.95) / 2)
> se <- diff(ppt$conf.int) / (2 * z_score)
> z <- pw$estimate / se
> 2 * (1 - pnorm(abs(z)))
[1] 0.05091551

that previous p-value does not agree with the obtained p-value using the provided confidence interval.

When I calculate it from the logistic regression using the marginal effects, I get the same confidence interval that prop.test() gives and the p-value I calculated above.

data <- data.frame(Status = c(rep(TRUE, 11), rep(FALSE, 16-11), rep(TRUE, 8), rep(FALSE, 21-8)), Group = c(rep("Gr1", 16), rep("Gr2", 21)))
> m <- glm(Status ~ Group,family = binomial(), data=data)
> margins::margins_summary(m)
factor     AME     SE       z      p   lower  upper
 GroupGr2 -0.3065 0.1570 -1.9522 0.0509 -0.6143 0.0012

But when I use ANOVA on this model with Rao test I get the p-value from prop.test()

> anova(m, test="Rao")
Analysis of Deviance Table
Model: binomial, link: logit
Response: Status
Terms added sequentially (first to last)
  Df Deviance Resid. Df Resid. Dev    Rao Pr(&gt;Chi)  

NULL                     36     51.266

Group  1   3.4809        35     47.785 3.4159  0.06457 .

Signif. codes:  0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

So how is that possible, that prop.test() gives me p-value which does NOT come from the confidence interval provided by the same function?

It looks like the confidence interval is Wald's, but the p-value is Rao score test. What is the sense behind it? I was taught that confidence interval should match the p-value. Here it does not, as they are calculated two different methods. It agrees in this scenario in statistical significance, but what if they do not?

EDIT: My question is about the possible statistical reasons for using different methods for calculating the p-value and confidence interval in one method. I guess there are some considerations that statisticians agree on here, so I'm curious why this discrepancy is justified?

EDIT: The linked answer: P value and confidence interval for two sample test of proportions disagree does not agree with my findings. It's clear, that the confidence interval I got is NOT THE WILSON CI, it's the WALD, so the linked answer is not correct. Again, my question is what is the reason behind reporting WALD confidence interval and Rao score test?

The proposed answer, which led to close my post is wrong for 2 proportions, which is evidently visible in my findings. This can be found also here: https://stats.stackexchange.com/a/570528/382831

Another proof:

> PropCIs::wald2ci(11, 16, 8, 21, conf.level=0.95, adjust="Wald")
data:
95 percent confidence interval:
 -0.001220547  0.614315785
sample estimates:
[1] 0.3065476

EDIT 3: It's NOT about using different distributions (it's the z statistic for the proportions, NOT t statistic, because of the s=p(1-p) mean-variance dependency!), as some people suggested in the comments!

Moreover, I didn't do any mistake in the formula:

se <- diff(ppt$conf.int) / (2 * z_score)

where 2 is NOT 1.96 or anything else rounded up, as some suggested that I use some "approximation". It's 1/2 of the CI width, "half-CI", the "precision" divided by the z_score: precision/z_score = (CI_length/2) / z_score = CI_length/(2*z_score).

You are obliged to pick one method of computing the p-value. If you do not, then it behooves you to use the most extreme value that is bad for you, lest you be accused (with good reason) of p-hacking. — whuber, Mar 10 '23 at 04:34
I'm sorry, I don't understand the link of your response to my question. My question is what is the possible reason for reporting inconsistent p-value and confidence interval. I have never seen anything like that. The CI should reflect the p-value. Personally I have no idea which test should I use and why here, but this is not my question (I could choose any), rather I ask what was the possible reason to report Rao test but Wald's confidence interval. The p-value does not agree with the calculation obtained from the CI. I assume there was some statistical reason for this and I'm curious what. — kmonk_gun, Mar 10 '23 at 04:42
Hi: This won't answer your question but a possible suggestion is to check the gory details of these various calls you make. For example, the margins package must have the margins_summary function in it. So maybe download the source and take a look. Currently, I'm trying to follow some of the details you gave above. I just have one question, In your calculation of the pvalue using the normal distribution, where does diff( ppt$confint)/(2 * zscore)) come from ? Is that some well known formula written in a not well known way ? Thanks. — mlofton, Mar 10 '23 at 05:25
It just the width of the CI divided by 2 z scores. It's the classic formula. — kmonk_gun, Mar 10 '23 at 21:54
You're assuming the errors are drawn from entirely different distributions in each case. As others have noted, you can get any results then. Recommend reading up on the differences and approximations of z score, t score, etc. — socialscientist, Mar 11 '23 at 01:41
I'm sorry, @socialscientist but your answer doesn't match my question. Kindly please, re-read my question, rather than sending me to the 101 class on t vs. z, because it's NOT about different distributions and approximations. Once again, I'm asking why the prop.test() returns DIFFERENT inferential methods for THE SAME question: Wald's CI vs. Rao test rather than Wald's test. The two outcomes are inconsistent within THE SAME statistical routine in R and this is the source of my question. I explained it in details, I don't know what else should I write to get answer why Rao test and not Wald. — kmonk_gun, Mar 11 '23 at 15:16
@socialscientist By no means it's about any "approximation"! This is to get the half width of the confidence interval, this is 1/2, one half, 0.5, not "1.96 rounded to 2" in the denominator! It's a classic 101 formula. — kmonk_gun, Mar 11 '23 at 15:21
Related: Why do the bootstrap calculated p-value and the confidence intervals seem to contradict each other? and Confidence interval / p-value duality: don't they use different distributions?. Possibly the same happens in your case. You would have to dig up the actual tests that are underlying the R functions that you have been using. — Sextus Empiricus, Mar 14 '23 at 12:20
There are situations, where the two will just differ, e.g. for a hypothesis test it's okay to work with a standard error that is only valid under the null hypothesis, while for a confidence interval you would not want to do that. I guess that's probably not the exact problem in this case. Why would you ever assume a normal distribution approximation would be okay for a difference in proportions? By definition that cannot possibly have a normal sampling distribution (except for asymptotically as the sample size goes to $\infty$ and if you are in interior of the parameter space). — Björn, Mar 14 '23 at 12:57

Sextus Empiricus · Accepted Answer · 2023-03-14T13:11:58.960

1

The difference is in how the standard error is computed. This can be done based on a pooled estimate of the variance and the assumption of equal probabilities (which makes sense if the null hypothesis is true and what you use when you compute the p-value), or based on non-pooled variance estimate.

$$z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{ \left(\frac{\hat{p}_1(1-\hat{p}_1)}{n_1} + \frac{\hat{p}_2(1-\hat{p}_2)}{n_2} \right)}} = \frac{\frac{11}{16} - \frac{8}{21}}{\sqrt{ \left(\frac{(11/16)\cdot(5/16)}{16} + \frac{(8/21)\cdot(13/21)}{21} \right)}} = 1.952191$$

$$z = \frac{\hat{p}_1 - \hat{p}_2}{\sqrt{\hat{p}(1-\hat{p}) \left(\frac{1}{n_1} + \frac{1}{n_2} \right)}} = \frac{\frac{11}{16} - \frac{8}{21}}{\sqrt{\frac{19 \cdot 18}{37^2} \left(\frac{1}{16} + \frac{1}{21} \right)}} = 1.848227$$

These relates to respectively a p-value of 0.05092 and 0.06457

> z = (11/16 - 8/21)/sqrt((11/16)*(5/16) * 1/16+ (8/21)*(13/21) * 1/21)
> 2*(1-pnorm(z))
[1] 0.05091551
> 
> z = (11/16 - 8/21)/sqrt((19/37)*(18/37) * (1/16+1/21))
> 2*(1-pnorm(z))
[1] 0.06456946

Also note the references in the prop.test manual

Newcombe R.G. (1998). Two-Sided Confidence Intervals for the Single Proportion: Comparison of Seven Methods. Statistics in Medicine, 17, 857–872. doi: 10.1002/(SICI)1097-0258(19980430)17:8<857::AID-SIM777>3.0.CO;2-E.

Newcombe R.G. (1998). Interval Estimation for the Difference Between Independent Proportions: Comparison of Eleven Methods. Statistics in Medicine, 17, 873–890. doi: 10.1002/(SICI)1097-0258(19980430)17:8<873::AID-SIM779>3.0.CO;2-I.

There are a lot of different approaches possible that may give a different result.

edited Mar 14 '23 at 13:11

answered Mar 14 '23 at 12:47

Sextus Empiricus

77,915

Thank you so much! Now I understand the difference (though still I'm wondering why the authors used one approach to calculate p-value, and other to calculate CI, which should be consistent within a single method... I'd love to know their reasons....) – kmonk_gun Mar 14 '23 at 13:33
2

@kmonk_gun It is for convenience. The approximation with a z-score assumes that the distribution of the observed statistic is normal distributed. But the standard error of the distribution depends, aside on the sample sizes, but these are fixed and known, on the actual $p_1$ and $p_2$. In constructing the confidence interval one would have to adjust for every value of the $p$'s the standard error in the computation. Instead it is assumed to be fixed. But which fixed values do you choose? For the confidence interval it makes sense to use the observed $\hat{p}_1$ and $\hat{p}_2$..... – Sextus Empiricus Mar 14 '23 at 13:42
2

....however for the test of the null hypothesis, which assumes $p_1 = p_2$, it makes sense to use $\hat{p}$ which is the estimate for the groups taken together. If you would use this $p_1 = p_2$ for the confidence interval, then the standard error is not very well computed when the confidence interval is far away from $p_1 = p_2$.... – Sextus Empiricus Mar 14 '23 at 13:43
1

... The Wilson score interval for a binomial proportion is an example where the variation of the standard error of the z statistic is taken into account when computing the confidence interval. – Sextus Empiricus Mar 14 '23 at 13:46
1

I wonder whether it is worth including your two comments about using two values of p or one value of p in the answer. That is the justification I always gave to people who queried the different formula. – mdewey Mar 14 '23 at 14:15
1

@mdewey I have no time for this now, but if I will remember this then I have a nice idea how to include it along with a graphic/image. (something close to https://en.m.wikipedia.org/wiki/File:Wilson_score_pdf_and_interval_equality.png) – Sextus Empiricus Mar 14 '23 at 14:23

Why is the confidence interval for difference in two proportions inconsistent with the p-value for this difference in R prop.test()?

1 Answers1