Consistence between: binomial confidence intervals, fisher test and power?

Question

The Question is at the end of the reasoning.

1 : Representation of binomial confidence intervals:

We want to know if the proportion of reds in zone 1 is significantly different from the proportion of reds in zone 2

df <- matrix(c(58, 51, 8, 17), nrow = 2,
                       dimnames =
                         list(c("zone1", "zone2"),
                              c("Red", "Blue")))
df
prop.table(df, margin=1)
df2<- data.frame(prop=c(0.8787879, 0.7500000),
                 zone=c(1,2))
df2
plot(prop~zone, data=df2, ylim=c(0,1))
segments(x0 = 1, y0 = binom.test(58, 66)$conf.int[1],
         x1 = 1, y1 = binom.test(58, 66)$conf.int[2], col = "red",
         lty = 1, lwd = 3)
segments(x0 = 2, y0 = binom.test(51, 68)$conf.int[1],
         x1 = 2, y1 = binom.test(51, 68)$conf.int[2], col = "red",
         lty = 1, lwd = 3)
abline(h=binom.test(58, 66)$conf.int[1], lty=2)
abline(h=binom.test(51, 68)$conf.int[2], lty=2)

Clearly, the two group's confidence intervals are overlapping. They should be far from statistical significance if we perform a test.

2: Some stats:

calculation of statistical power to use the fisher test:

library('statmod')
(power.fisher.test(p1=0.88, p2=0.75, n1=66, n2=68, alpha=0.08, nsim=1000, alternative="two.sided"))

power of 0.53 considering the 13% of difference and sample size in each group

Now the fisher test:

fisher.test(df)

p=0.07 which is close to a 0.05 significance

--> My question: Why the fisher test gives a p.value quite low (0.07), whereas the binomial confidence intervals are overlapping very much and the power calculation of the Fisher test in this context suggests to be very low (so we should not have so low pvalue) ?

Neither estimate is in the other group's 95% CI, so that indicates some evidence of a difference, but it's not strong evidence, since the CIs overlap. A p-value of 0.07 is consistent with some evidence of a difference, but not strong evidence. It happens by chance about 1 in 15 times when there is no difference at all. — user2554330, Jul 30 '22 at 11:46
Also, bear in mind that Fisher's Exact Test (which is exact only in situations that rarely apply in practice) and the binomial test actually test different hypotheses. The hypotheses are closely related, but different nonetheless. Their p-values (and assocaited CIs) will be similar but need not be identical. Depending on the orientation of the table, the binomial test is actually closer to a Cochran-Mantel-Haenszel test for the equality of row (column) means. — Limey, Jul 30 '22 at 11:59
Thank you both! It clarifies. But I am still surprised to see such a discrepancy between the binomial test and the Fisher test. I know their CIs/pvalues would not be exactly the same but I would expect them to be closer than what is observed. Binomial 95% CIs overlap a lot in this example. The power of the Fisher test in this example is 0.53 which is very low, so I am surprised to have a p of 0.07… with very overlapping Binomial CI… it sounds not very reliable, nonetheless, it is indicative of a “trend” I guess... — SkyR, Jul 30 '22 at 12:36
This power calculation is called a posteriori power. Never, never do that. The a posteriori power is a decreasing function of the p-value., that makes no sense. — Stéphane Laurent, Jul 30 '22 at 13:00
Moreover, checking the overlap of the individual intervals is a wrong way. The Chi square test gives a p-value of 0.09, consistent with the Fisher test. chisq.test(as.table(rbind(c(58, 8), c(51, 17)))). — Stéphane Laurent, Jul 30 '22 at 13:06
Thank you. Do you mean “a posteriori power” should never been calculated? Why checking overlap of 95% confidence intervals is a wrong way? — SkyR, Jul 30 '22 at 16:52
Does this answer your question? Relation between confidence interval and testing statistical hypothesis for t-test Although that question is in the context of t-tests, the principle holds in all testing: lack of overlap of 95% confidence intervals is much more stringent than a p <0.05 test for a difference. — EdM, Jul 31 '22 at 15:36
Thank you! Yes partly, and I have run examples. The fisher.test method and binomial CIs overlap method are similar, valid to base the interpretation of an effect significance, but they are not exactly the same, especially when the effect is low or when the evidence is only indicative but not strong. Fisher.test is more sensitive (for 0.01<p<0.05) than the CIs method, but below p≈0.01 they are the same. Therefore CIs method is conservative but may give more false negatives, while Fisher.test may give less false negatives but possibly more false positives. — SkyR, Aug 01 '22 at 12:23
Additionally, it turns out that the CI method was able to detect the significance of an effect only when the fisher power was above 0.8 (for another example run for which n=100, p1=0.95), whereas the fisher.test pvalue was able to detect a mild effect at 0.01<p<0.05 when its own power was ranging from 0.5 to 0.8; I would call it the "discrepancy zone" or the "grey zone" — SkyR, Aug 01 '22 at 13:47