13

If we have an underpowered study but manage to reject the null hypothesis, anyway, it makes sense to wonder if we have overestimated the effect size.

However, such a concern seems unwarranted if we use an unbiased estimator of the effect size (such as $\bar X_1-\bar X_2$ to estimate $\mu_1-\mu_2$).

At the same time, there is an appeal to speculating that we must have lucked into a particularly large observed effect to have managed to reject despite the low power.

This seems like a paradox. Does it have a resolution?

Dave
  • 62,186
  • Power is based on a prior notion of the effect size. If that prior is good, then the estimate is probably too large, possibly just by chance. If the prior is not good, then the estimate might be ok. – BigBendRegion Nov 08 '22 at 15:50
  • 1
    @dipetkov Sure, but then why would there be an expectation that this estimate is too high? (I mean, I just posted a self-answer explaining why, but if an estimator is unbiased, we should not expect any particular value it gives to be a high miss rather than a low miss.) – Dave Nov 08 '22 at 17:28
  • A study is underpowered for the true effect size if we optimistically put in $\theta = 1$ in the power computation but the true $\theta$ is, say, 0.1. (Okay, in reality the "optimism" is probably a lot less.) Then what do you expect for the magnitude of $\hat{\theta}$ when it happens that the null hypothesis $\theta = 0$ is rejected? – dipetkov Nov 08 '22 at 17:40
  • How did you choose the reference value for computing the power? Of course if you choose it to be the true value, it means you know the true value, and then estimating it is pointless (and you do not expect but know whether you have overestimated the effect size, looking at the estimator). – Christian Hennig Nov 08 '22 at 18:05
  • @ChristianHennig You determine the effect size of interest that you want to be able to detect with a specified power, and then you find the sample size needed to do so. – Dave Nov 08 '22 at 18:13
  • 1
    @Dave Doesn't this then depend on what the effect size of interest is and how far it is from the null hypothesis? Actually it seems that for any study you can find an effect size large enough that it is underpowered, so in general the term "underpowered" doesn't add any information!? Also in your answer, considerations regarding $\hat\theta|$ reject $H_0$ don't seem to have anything to do with being underpowered. – Christian Hennig Nov 08 '22 at 18:16
  • I’m debating about an aspect of the simulation I posted and have deleted that answer until I resolve the issue. – Dave Nov 08 '22 at 18:37
  • 3
    I'm afraid I don't quite see the paradox. In your example of using $\overline{X}_1-\overline{X}_2$ to estimate $\mu_1-\mu_2$, yes, the estimator is unbiased. But that only means that its expectation is equal to $\mu_1-\mu_2$. Any specific realization could be far away from it, so we might indeed overestimate the effect size. And as multiple answers note, conditional on submitting a manuscript (because something is significant, because everything non-significant ends up in a desk drawer and is not submitted), the problem is more probable. What am I missing? – Stephan Kolassa Nov 09 '22 at 08:58
  • When talking about effect size, I would use $|\bar{X_1}-\bar{X}_2|$ rather than $\bar{X_1}-\bar{X}_2$. This is a nitpick, but a mix up between the two has led me astray, so let others know better. – Richard Hardy Nov 09 '22 at 18:09

6 Answers6

18

It seems to me that an underpowered study by definition is unlikely to give a small p-value against the null. Consequently, if you do get a small p-value it is likely that you are overestimating the true effect size.

However, if you look at all estimates from repeated experiments, regardless of their significance threshold, you do get an unbiased overall estimate. This should reconcile your paradox? Here's a simulation to illustrate:

We repeat an underpowered experiments 10,000 times:

set.seed(1234)
es <- 0.1 # The true difference
n <- 5    # and a small sample size
est <- rep(NA, 10000)
p <- rep(NA, length(est))
for(i in 1:length(est)) {
    a <- rnorm(n, mean=0)
    b <- rnorm(n, mean=0 + es)
    tt <- t.test(a, b)
    est[i] <- diff(tt$estimate)
    p[i] <- tt$p.value
}

If you consider only experiments with p < 0.05 you get extreme estimates (blue line is the true value). Note that some estimates are extreme and are also in the wrong direction (those on the left of the blue line):

hist(est[p < 0.05], xlab='Estmates where p < 0.05', main='')
abline(v=es, col='blue', lty='dashed')

enter image description here

Nevertheless the estimatator is unbiased across the 10,000 experiments:

mean(est)
[1] 0.1002

Count of over- and under-estimating experiments:

length(est[est > es]) [1] 5056 length(est[est <= es]) [1] 4944

dariober
  • 4,250
9

To resolve the issue of bias, note that, when we consider the effect size in a test that rejects, we no longer consider the entire distribution of $\hat\theta$ that estimates $\theta$ but $\hat\theta\vert\text{reject }H_0$, and there is no reason to expect this latter distribution to have the unbiasedness that $\hat\theta$ has.

Regarding the issue of being "underpowered", it is true that a formal definition of this term would be nice. Note, however, that as power increases, the estimation bias in estimates corresponding to rejected null hypotheses decreases.

library(pwr)
library(ggplot2)
set.seed(2022)
Ns <- seq(50, 2000, 50)
B <- 10000
powers <- biases <- ratio_biases <- rep(NA, length(Ns))
effect_size <- 0.1
for (i in 1:length(Ns)){

powers[i]<- pwr::pwr.t.test( n = Ns[i], d = effect_size, type = "one.sample" )$power

observed_sizes_conditional <- rep(NA, length(B)) for (j in 1:B){

x &lt;- rnorm(Ns[i], effect_size, 1)

pval &lt;- t.test(x)$p.value

if (pval &lt;= 0.05){
  observed_sizes_conditional[j] &lt;- mean(x)
}

observed_sizes_conditional &lt;- observed_sizes_conditional[                  
  which(is.na(observed_sizes_conditional) ==  F)
  ]

ratio_biases[i] &lt;- mean(observed_sizes_conditional)/effect_size
biases[i] &lt;- mean(observed_sizes_conditional) - effect_size

}

print(paste(i, "of", length(Ns))) } d1 <- data.frame( Power = powers, Bias = biases, Statistic = "Standard Bias" ) d2 <- data.frame( Power = powers, Bias = ratio_biases, Statistic = "Ratio Bias" ) d <- rbind(d1, d2) ggplot(d, aes(x = Power, y = Bias, col = Statistic)) + geom_line() + geom_point() + facet_grid(rows = vars(Statistic), scale = "free_y") + theme_bw() + theme(legend.position="none")

I do not know the correct term for what I mean by "ratio bias", but I mean $\dfrac{\mathbb E[\hat\theta]}{\theta}$. Since the effect size is not zero, this fraction is defined.

enter image description here

This makes sense for the t-test, where the standard error will be larger for a smaller sample size (less power), requiring a larger observed effect to reach significance.

By showing this, we avoid that irritating issue of defining what an "underpowered" study means and just show that more power means less estimation bias. This explains what is happening in the linked question, where a reviewer asked an author for the power of the test in order to screen for gross bias in the conditional estimator $\hat\theta\vert\text{reject }H_0$. If the power is low, the graphs above suggest that the bias will be high, but high power makes the bias nearly vanish, hence the reviewer wanting high power.

Dave
  • 62,186
  • 2
    Upvotes to the others who posted answers: I would not have figured this out or thought to run this simulation without reading your responses. – Dave Nov 08 '22 at 22:34
  • 3
    I think this paper is relevant to the topic: A. Gelman and J. Carlin. Beyond power calculations: Assessing type S (sign) and type M (magnitude) errors. They use "exaggeration ratio" for the quantity you term "ratio bias". – dipetkov Nov 08 '22 at 22:55
  • @dipetkov I like that! Have you seen it used elsewhere? – Dave Nov 08 '22 at 22:56
  • Gelman uses S error and M error more often than exaggeration ratio. Outside of his work, I'm not sure how many use the terms I think these tricky topics fall under the umbrella: it's not a good idea to focus primarily on whether a result is statistically significant or not. You know how widely this idea is accepted... – dipetkov Nov 08 '22 at 23:00
  • @RichardHardy I have a symmetric estimator yet achieve bias when I condition on rejecting, don’t I? // Even with low power, doing $10000$ simulations per sample size should result in at least a thousand rejections and accurate estimates of the bias. – Dave Nov 09 '22 at 17:14
  • @RichardHardy That’s a fair point. Adding a t-test of the bias for each sample size is showing biases significantly ($\alpha=0.05$) different from zero until the two largest sample sizes. (A nice GGPlot2 exercise for me might be to add confidence intervals on the graph.) // At no point in my code do I take an absolute value. – Dave Nov 09 '22 at 17:30
  • At no point do I take the absolute value, yet the simulation achieves significant biases. – Dave Nov 09 '22 at 17:51
  • 1
    I wonder whether the distribution of $\hat\theta|$ reject $H_0$ in itself should raise the suspicion that $\hat\theta$ overestimates. I mean, under suitable assumptions you can show that $E(\hat\theta|\hat\theta>1)>E(\hat\theta)$, but this doesn't mean we should suspect $\hat\theta$ to be an overestimation whenever it comes out larger than 1 (or larger than any constant, as in many situations this would hold for any constant larger than $-\infty$) - or does it? – Christian Hennig Nov 10 '22 at 10:27
  • 2
    Probably $\hat\theta|$ reject $H_0$ is relevant in case that $\hat\theta$ is only reported or visible if $H_0$ is rejected. If reporting $\hat\theta$ is conditioned on rejection, reported effect sizes overestimate on average - this is the well known "publication bias". But if it's our own study and we see $\hat\theta$ in any case, rejecting $H_0$ will not bias it. – Christian Hennig Nov 10 '22 at 10:29
6

You may have an estimator $\hat\theta$ that is (unconditionally) unbiased for its target: $\mathbb{E}(\hat\theta)=\theta$. The absolute value of the estimator $|\hat\theta|$ may also be (unconditionally) unbiased for the absolute value of the target: $\mathbb{E}(|\hat\theta|)=|\theta|$. (The absolute value rather than the raw value is relevant when considering effect size.)

However, once you condition on statistical significance of the estimate, the absolute value of the conditional estimator will generally no longer be unbiased for the absolute value of the target: $\mathbb{E}(|\hat\theta|\mid \hat\theta\text{ is stat. signif. at }\alpha\text{ level})\neq|\theta|$.

(I had struggled with a similar question over here: Understanding Gelman & Carlin "Beyond Power Calculations: ..." (2014). The issue was not really the essence but rather presentation. In the beginning it was not immediately clear to me that Gelman & Carlin were actually conditioning on statistical significance.)

Richard Hardy
  • 67,272
5

Possibly the following image might shed some light

different fiducial densities with the same p-value

Given that the null hypothesis is true, there will always be an $\alpha\%$ chance to reject the null hypothesis, no matter what the power of a test is*.

But the power of the test makes the overall picture a lot more different. Possibly the paradox stems from gazing too much exclusively at the null hypothesis and p-value.


*Or actually the percentage to reject might be a bit higher because the hypothesis test is based on a theoretical model for the error and the reality might be different (sampling errors like outliers or correlation between measurements).

Richard Hardy
  • 67,272
5

It's not a paradox. You may call it a dilemma, or more precisely an unknown. You have correctly narrowed it down to the two possible outcomes: you are either really "lucky", or the assumptions behind the power calculation are incorrect. There is no way to know which is which based on the results of one study alone. These considerations matter even for well powered studies with statistically significant findings.

AdamO
  • 62,637
  • 2
    What does "really lucky" mean in this context? – dipetkov Nov 08 '22 at 22:13
  • @dipetkov: in the example in the original post of using $\overline{X}_1-\overline{X}_2$ to estimate $\mu_1-\mu_2$, yes, the estimator is unbiased. But that only means that its expectation is equal to $\mu_1-\mu_2$. Any specific realization could be far away from it, which would be "really lucky", in terms of estimating a large effect size (yay, we can publish!), or "really unlucky" in terms of injecting noise into the scientific conversation. – Stephan Kolassa Nov 09 '22 at 08:59
  • @StephanKolassa That's only if you equate "lucky" with "publish my paper whether the science behind is good or not". So you are lucky but science not so much. – dipetkov Nov 09 '22 at 09:10
5

You have hit on the same question discussed in the well-known Why Most Published Research Findings Are False paper. If you do a lot of experiments as a scientific community and quite a few of the tested null hypotheses are true (i.e. people try to show a whole lot of effects that aren't really there, while some are), then "underpowered" studies are more likely to produce false positive findings than "well-powered" studies. Similarly, once one conditions on statistical significance, point estimates are biased away from where you put your null hypothesis. This bias is larger, the more underpowered a study is.

You might critique this by saying that null hypotheses are rarely exactly true, but the exact same things happen when you instead look at a set-up where many effects are very small and only a few are big.

People worry about this a lot in drug development, where large companies will run early stage proof of concept studies (you can look at those as a kind of screening tool for deciding which projects to pursue further) for many potentially promising new drugs (of which most will not have a meaningful effect on the disease of interest). It is important for these studies to not be completely underpowered, because otherwise "positive" proof of concept results will become useless as a tool for prioritizing which drugs to study further.

Björn
  • 32,022
  • 1
    @SextusEmpiricus it's actually also that. Imagine the extreme case of running lots of studies with 0 data and rejecting the null hypothesis at random 5% of the time (severely underpowered studies are barely better than that, in any case). In that case the probability of getting false research findings amongst the total of findings is higher (basically the proportion of true null hypotheses) than if you ran well-powered studies (in which case it becomes increasingly more likely that you would actually identify false null hypotheses). – Björn Nov 10 '22 at 09:55
  • 2
    "You might critique this by saying that null hypotheses are rarely exactly true, but the exact same things happen when you instead look at a set-up where many effects are very small and only a few are big." The difficulty here is that it isn't entirely clear in case of a very small effect that one would actually want to not reject. In case the H0 is true it's clear that rejection is a type I error. In case the effect is very small, it depends on how small it exactly is and what counts as a meaningful effect in the specific situation. Technically it's not wrong to reject a false H0. – Christian Hennig Nov 10 '22 at 10:23
  • 1
    @ChristianHennig True, good points, if we talk about point null hypotheses vs. two-sided alternatives. That's where it's then an issue that one would in those situations not just reject the null hypothesis, but also vastly overestimate the effect (that's the bit that certainly doesn't change). Rejecting the null hypothesis of no effect when there's a true small effect would be just fine, if one then correctly "identified" that there's only a small effect. The hypothesis testing part still goes wrong though, if you test a null hypothesis like "absolute effect size is irrelevant = $< \epsilon$". – Björn Nov 10 '22 at 15:13