Reconciling the Minimum Detectable Effect Calculation with the Experiment Result

Question

I am puzzled why I am able to detect a statistically significant effect that is smaller than the MDE from the power calculation for the same set of parameters.

Here's Stata code for for the familiar two-sample proportions test:

. power twoproportions 0.146, n1(254426) n2(255237) alpha(.05) power(0.8) test(chi2) effect(diff)
Performing iteration ...
Estimated experimental-group proportion for a two-sample proportions test
Pearson's chi-squared test 
H0: p2 = p1  versus  Ha: p2 != p1; p2 > p1
Study parameters:
    alpha =    0.0500
    power =    0.8000
        N =   509,663
       N1 =   254,426
       N2 =   255,237
    N2/N1 =    1.0032
       p1 =    0.1460


Estimated effect size and experimental-group proportion:
    delta =    0.0028  (difference)
       p2 =    0.1488


. prtesti 255237 .1488 254426 0.146 , level(95) // use MDE from above
Two-sample test of proportions                     x: Number of obs =   255237
                                                   y: Number of obs =   254426

         |       Mean   Std. err.      z    P&gt;|z|     [95% conf. interval]

-------------+----------------------------------------------------------------
           x |      .1488   .0007044                      .1474193    .1501807
           y |       .146      .0007                      .1446279    .1473721
-------------+----------------------------------------------------------------
        diff |      .0028   .0009931                      .0008535    .0047465
             |  under H0:   .0009931     2.82   0.005

    diff = prop(x) - prop(y)                                  z =   2.8193
H0: diff = 0

Ha: diff &lt; 0                 Ha: diff != 0                 Ha: diff &gt; 0

Pr(Z < z) = 0.9976         Pr(|Z| > |z|) = 0.0048          Pr(Z > z) = 0.0024

So far that's as expected. But here's the hypothesis test with a smaller effect:

. prtesti 255237 0.148 254426 0.146 , level(95)
Two-sample test of proportions                     x: Number of obs =   255237
                                                   y: Number of obs =   254426

         |       Mean   Std. err.      z    P&gt;|z|     [95% conf. interval]

-------------+----------------------------------------------------------------
           x |       .148   .0007029                      .1466224    .1493776
           y |       .146      .0007                      .1446279    .1473721
-------------+----------------------------------------------------------------
        diff |       .002    .000992                      .0000557    .0039443
             |  under H0:    .000992     2.02   0.044

    diff = prop(x) - prop(y)                                  z =   2.0161
H0: diff = 0

Ha: diff &lt; 0                 Ha: diff != 0                 Ha: diff &gt; 0

Pr(Z < z) = 0.9781         Pr(|Z| > |z|) = 0.0438          Pr(Z > z) = 0.0219

Here I reject the null even though the effect is less than 0.1488.

I am also seeing this with online MDE calculators.

Any idea why the MDE is so conservative here? Or is this not an apples-to-apples comparison because the power dips with the smaller effect:

    . power twoproportions 0.146 0.148 , n1(254426) n2(255237) alpha(.05) test(chi2) effect(diff)
Estimated power for a two-sample proportions test
Pearson's chi-squared test 
H0: p2 = p1  versus  Ha: p2 != p1
Study parameters:
    alpha =    0.0500
        N =   509,663
       N1 =   254,426
       N2 =   255,237
    N2/N1 =    1.0032
    delta =    0.0020  (difference)
       p1 =    0.1460
       p2 =    0.1480


Estimated power:
    power =    0.5224

I guess the lesson here is that power isn't an absolute property of a test and is relative to the size of the effect you want to detect. The effect I detect is smaller, but the power is worse.

If I read all this input and output correctly, you want to be able to detect a particular effect size 80% of the time. Why would it be strange, then, to be able to detect smaller effects? If you couldn't detect smaller effects, wouldn't your power be closer to 5% than 80%? — whuber, Jul 30 '21 at 18:40
@whuber I left the key results out. Mea culpa, but now it is fixed. — dimitriy, Jul 30 '21 at 18:56

B.Liu · Accepted Answer · 2021-07-30T21:02:19.203

The answer will use the unstandardised delta (0.0028 and 0.0020 in the examples provided in the question) when referring to effect sizes.

Relationship between MDE and Test Power

The MDE is always tied to a test power in a test - a MDE figure without a test power figure attached technically does not make sense (though many assume implicitly that the power is 80% when talking about MDEs).

The MDE under a specific test power tells you the minimum effect size required to achieve the said test power. That is, if you have a MDE of 0.0028 under 80% test power, you need your treatment to have an effect size of at least 0.0028 to guarantee the statistical test to return a statistically significant result 80% of the times (assuming all other assumptions are met).

I am puzzled why I am able to detect a statistically significant effect that is smaller than the MDE from the power calculation for the same set of parameters.

Despite what is in the name of MDE, it is not true that you will no longer detect any effect (via a significant result) once its size drops below the 0.0028 threshold - you can still, by chance, obtain samples with a large enough gap between two groups that lead to a statistically signifiant result. The chance is just lower.

In your example, we can see the MDE (for the underlying effect size) under 80% test power is 0.0028, whereas the MDE under 50% test power would be ~0.0020. The latter can be calculated on Stata by:

. power twoproportions 0.146, n1(254426) n2(255237) alpha(.05) power(0.5) test(chi2) effect(diff)

This shows that if the effect size is 0.0020, we still stand a 50% chance in seeing a statistically significant result. Indeed, assuming all other parameters are fixed, the MDE increases with increasing required test power.

And thus, the answer to

Any idea why the MDE is so conservative here?

is because we have constructed it so via our high test power requirement. By setting it to 80% we intend to give ourselves quite a bit of slack.

Side note:

It is important to differentiate the a priori test power, which is calculated using an assumed true, underlying treatment effect prior the start of an experiment, and the post hoc test power, which is calculated using the treatment effect derived from the data / samples.

I guess the lesson here that power isn't an absolute property of a test and is relative to the size of the effect you want to detect. The effect I detect is smaller, but the power is worse.

While the power is indeed dependent on effect size with other parameters fixed, one often fixes the a priori test power (and hence determines the MDE), and calculates the post hoc test power given the sample effect size. The dynamics are thus different.

You can also involve the sample sizes into the "fix two and get the third one free" game, but it is out of scope of this question.

Reconciling the Minimum Detectable Effect Calculation with the Experiment Result

1 Answers1

Relationship between MDE and Test Power

Linked