Power test gives power=0.995

Question

I am running a power test for an experiment where I have 122 controls (n1) and 184 experimental sets (n2). I selected a medium effect size of 0.5. I used the pwr.2p2n.test function in the pwr package, which is for a power calculation for two proportions (different sample sizes).

    pwr.2p2n.test(h = 0.5, n1 = 122, n2 = 184, sig.level = 0.05)

The calculated power is 0.99.

Knowing that power values are between 0 and 1, I was wondering if such a high value indicates that something is wrong in my analysis?

You're asking R what is the statistical power of detecting a 50% difference given those two samples, right? Don't you think this amounts to a very large difference, to say the least? — chl, Oct 28 '20 at 14:08
According to Cohen 1988 (http://www.utstat.toronto.edu/~brunner/oldclass/378f16/readings/CohenPower.pdf) 0.5 is a medium effects size. I thought it would be appropriate. Is it not? — Alessandra Bielli, Oct 28 '20 at 17:10
My bad, I thought h was the difference between the two proportions, but it is the difference between the two proportions after arcsine transformation, which makes sense after all since otherwise we would expect to pass the two proportions as arguments to the function, like in power.prop.test(). Maybe you could expand upon your design and the observed or expected proportions, so that it is clear that a medium effect size is the best choice. In any case, a value of 0.99 suggests that you will likely detect a significant difference if there's one. — chl, Oct 28 '20 at 17:59
These specifications of effect sizes (like 0.5 is medium) are a rule of thumb. If you can, then you should verify, by using the formula's whether it the effect size that you are aiming for. — Sextus Empiricus, Oct 29 '20 at 19:20

score 4 · Answer 1 · edited Jul 12 '21 at 17:33

4

Maybe the following reasoning can help you understand why 0.99 seems a suspiciously high power.

An $h = 0.5$ is about the difference between the probability of success 0.7 vs 0.46 (ES.h(0.7, 0.46) = 0.49). With a sample size of 153 in each group, this is the difference between 107 and 70 successes which is quite noticeable especially since $\alpha = 0.05$ is not very stringent.

This simulation verifies it is correct:

    p1 <- 0.7
    p2 <- 0.46
n1 &lt;- 122
n2 &lt;- 184

nreps &lt;- 10000
set.seed(12345)
y1 &lt;- rbinom(n= nreps, size= n1, p= p1)
y2 &lt;- rbinom(n= nreps, size= n2, p= p2)

pval &lt;- rep(NA, nreps)
for(i in 1:nreps) {
    pval[i] &lt;- prop.test(c(y1[i], y2[i]), n= c(n1, n2), 
                         p= NULL)$p.value
}

(power &lt;- sum(pval &lt; 0.05) / nreps) # &lt;- 0.9851 as expected

But even if there is nothing wrong with your calculation, 0.99 power may still be too optimistic because it assumes your counts come from a binomial distribution. In real life, especially in biology, the binomial is too narrow and doesn't account for variation other than the random sampling. Maybe this is why your intuition doesn't match your power analysis. Here I simulate counts where the probability of success is a random variable with Beta distribution.

Even if on average the simulated counts are as expected (~70% success for n1 and ~46% for n2) the power is quite a bit lower:

    nreps <- 10000
    set.seed(12345)
    y1 <- rbinom(n= nreps, size= n1, p= rbeta(n= nreps, 
                 6.65, 2.85))
    y2 <- rbinom(n= nreps, size= n2, p= rbeta(n= nreps, 
                 5.25, 6.17))
pval &lt;- rep(NA, nreps)
for(i in 1:nreps) {
    pval[i] &lt;- prop.test(c(y1[i], y2[i]), 
                    n= c(n1, n2), p= NULL)$p.value
}

(power &lt;- sum(pval &lt; 0.05) / nreps) # 0.775

The parameters of the Beta distributions above are such that they give mean 0.7 for n1 and 0.46 for n2 with variance 0.02 (no particular reason to pick that variance). I used this function posted at Calculating the parameters of a Beta distribution using the mean and variance:

    estBetaParams <- function(mu, var) {
      alpha <- ((1 - mu) / var - 1 / mu) * mu ^ 2
      beta <- alpha * (1 / mu - 1)
      return(params = list(alpha = alpha, beta = beta))
    }

edited Jul 12 '21 at 17:33

kjetil b halvorsen

77,844

answered Oct 29 '20 at 11:44

dariober

4,250

Interesting. In my experience, usually for power analysis the population parameters are treated as fixed, although I completely agree that this is rather naive. I've seen only few exceptions, usually when Bayesian approaches come into play. Could you recommend any reference for that practice in an ordinary sample size calculation? – LuckyPal Oct 29 '20 at 12:02
@LuckyPal Sorry, I don't have at hand any particular reference. Besides, I don't have much experience with power calculations myself but those few times I've done it using standard methods, I also got the impression that they were giving too optimistic results
- hence my interest in this question.
– dariober Oct 29 '20 at 13:02
I see, thank you. In my experience, at least in the biomedical field, the probability of success is basically always considered as fixed when planning a study. While I definitely think this is not best practice, I guess many people who are not statisticians would be deeply confused if the population parameter suddenly is a random variable itself. Just out of curiosity, is there any reasoning behind the parameters that you chose for the Beta distribution? – LuckyPal Oct 29 '20 at 13:11
1

@LuckyPal I edited my answer to explain the origins of the beta parameters – dariober Oct 29 '20 at 13:23
2

many people who are not statisticians would be deeply confused if the population parameter suddenly is a random variable itself I agree, maybe an intuitive way to explain it is like this: You have 100 patients - can you assume that they are all identical and their "probability of success" is the same? Probably not so it's reasonable to assign to each of them a different probability of success according to some distribution. – dariober Oct 29 '20 at 13:33
That's definitely a good approach, but I am not sure whether that will convince everyone. I've often heard physicians saying "we still don't understand why the drug works in some patients and in others not" - thus, some are well aware of the success probabilty not being the same for every patient, but they are fine ignoring that if one cannot distinguish them (which I find understandable). – LuckyPal Oct 29 '20 at 13:45
Let us continue this discussion in chat. – LuckyPal Oct 29 '20 at 13:45
Did you on purpose not use the standard functions for the betabinomial distribution? (Like from the VGAM package) – Sextus Empiricus Oct 29 '20 at 19:24
@SextusEmpiricus It wasn't on purpose - I forgot about the betabinomial until I read your comment. Incidentally, though, I guess my approach makes the reasoning more explicit. – dariober Oct 30 '20 at 09:03

Sextus Empiricus · Answer 2 · 2020-10-29T19:44:24.417

The pwr.2p2n.test function is based on the testing of proportions with Cohen's h and the variance stabilizing transformation (See for the original source of this statistic: Jacob Cohen 1966)

$$\Phi = 2 \text{arcsin} \sqrt{p}$$

These $\Phi$ are approximately normal distributed with variance $\frac{1}{N}$

For the difference between two of these transformed variables

$$h=\Phi_2-\Phi_1$$

we will expect a variance equal to $\frac{1}{n_1}+ \frac{1}{n_2}$ or standard deviation $\sqrt{\frac{n_1+n_2}{n_1n_2}}$.

If we let, for simplicity $n= n_1 = n_2$ then this becomes $\sqrt{2/n}$.

So for a value of $n \approx 150$ you get that the standard deviation of $h$ will be approximately $0.1$, several orders below your aimed effect size of $0.5$, and thus quite powerfull.

See:

Jacob Cohen 1966, An Alternative to Marascuilo's "large-sample multiple comparisons" for proportions, Psychological Bulletin http://dx.doi.org/10.1037/h0020418

score 1 · Answer 3 · answered Oct 29 '20 at 09:52

Comment: Not sure exactly what you're asking. Possibly relevant output from a recent release of Minitab, where both samples have to be the same size.

Power and Sample Size
Test for Two Proportions
Testing comparison p = baseline p (versus ≠)
Calculating power for baseline p = 0.7
α = 0.05
           Sample

Comparison p    Size    Power
        0.20      50  0.99980
        0.20     100  1.00000
        0.35      50  0.95043
        0.35     100  0.99931
The sample size is for each group.

score 1 · Answer 4 · answered Oct 29 '20 at 11:22

On page 187 in the Cohen (1988) book, which you referenced in a comment, there is actually a table showing that with $h=0.5$ and $n=200$ the power is larger than $0.995$. So nothing is wrong with the calculation.

However, you have a very large sample size for what Cohen considers a "medium effect size", thus a large power is not suprising. When performing any power analysis, the magnitude of an effect should always be seen in relation to the sample size. In very small sample sizes, "large effects" might actually be small, and vice versa. For example, when planning animal experiments, you rarely see effect sizes below $d=1.5$. Cohen considered $d=0.8$ as a "large effect" but I have not yet met a single researcher who does animal experiments and who would be happy to report such an effect size in a publication.

Power test gives power=0.995

4 Answers4