I am puzzled why I am able to detect a statistically significant effect that is smaller than the MDE from the power calculation for the same set of parameters.
Here's Stata code for for the familiar two-sample proportions test:
. power twoproportions 0.146, n1(254426) n2(255237) alpha(.05) power(0.8) test(chi2) effect(diff)
Performing iteration ...
Estimated experimental-group proportion for a two-sample proportions test
Pearson's chi-squared test
H0: p2 = p1 versus Ha: p2 != p1; p2 > p1
Study parameters:
alpha = 0.0500
power = 0.8000
N = 509,663
N1 = 254,426
N2 = 255,237
N2/N1 = 1.0032
p1 = 0.1460
Estimated effect size and experimental-group proportion:
delta = 0.0028 (difference)
p2 = 0.1488
. prtesti 255237 .1488 254426 0.146 , level(95) // use MDE from above
Two-sample test of proportions x: Number of obs = 255237
y: Number of obs = 254426
| Mean Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
x | .1488 .0007044 .1474193 .1501807
y | .146 .0007 .1446279 .1473721
-------------+----------------------------------------------------------------
diff | .0028 .0009931 .0008535 .0047465
| under H0: .0009931 2.82 0.005
diff = prop(x) - prop(y) z = 2.8193
H0: diff = 0
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(Z < z) = 0.9976 Pr(|Z| > |z|) = 0.0048 Pr(Z > z) = 0.0024
So far that's as expected. But here's the hypothesis test with a smaller effect:
. prtesti 255237 0.148 254426 0.146 , level(95)
Two-sample test of proportions x: Number of obs = 255237
y: Number of obs = 254426
| Mean Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
x | .148 .0007029 .1466224 .1493776
y | .146 .0007 .1446279 .1473721
-------------+----------------------------------------------------------------
diff | .002 .000992 .0000557 .0039443
| under H0: .000992 2.02 0.044
diff = prop(x) - prop(y) z = 2.0161
H0: diff = 0
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(Z < z) = 0.9781 Pr(|Z| > |z|) = 0.0438 Pr(Z > z) = 0.0219
Here I reject the null even though the effect is less than 0.1488.
I am also seeing this with online MDE calculators.
Any idea why the MDE is so conservative here? Or is this not an apples-to-apples comparison because the power dips with the smaller effect:
. power twoproportions 0.146 0.148 , n1(254426) n2(255237) alpha(.05) test(chi2) effect(diff)
Estimated power for a two-sample proportions test
Pearson's chi-squared test
H0: p2 = p1 versus Ha: p2 != p1
Study parameters:
alpha = 0.0500
N = 509,663
N1 = 254,426
N2 = 255,237
N2/N1 = 1.0032
delta = 0.0020 (difference)
p1 = 0.1460
p2 = 0.1480
Estimated power:
power = 0.5224
I guess the lesson here is that power isn't an absolute property of a test and is relative to the size of the effect you want to detect. The effect I detect is smaller, but the power is worse.
