5

With all the concern about reproducibility, I have not seen a very basic question answered. Using the standard hypothesis testing approach, if one experiment results in p<0.05, what is the chance that a repeat experiment will also result in p<0.05? I've seen a related problem approached by Goodman (1) and others, starting with a particular p-value for the first experiment, but I have not seen it more generally as I stated the problem.

So my question here is if the approach below has already been published somewhere.

Let’s make pretty standard decisions that alpha = 0.05 and power = 0.80. We also need to define the scientific context of the experimentation. Let's say we are in a situation where you expect half the hypotheses tested to be true and half are not. In other words the probability of the null hypothesis is 0.50, which we'll call pNull.

Let's compute the results of 1000 (arbitrary, of course) first experiments.

  • Number of experiments where the null H is actually true = 1000 * pNull = 500.
  • Number of these expected to result in p<alpha = 500 * alpha = 25 experiments.
  • Number of experiments where the alternative H is actually true = (1 - pNull)*1000 = 500
  • Number of these expected to result in p<alpha = 500 * power = 400
  • Total experiments expected to result in p<alpha = 25 + 400 = 425

Now on to the second experiment. We only run the second experiment for cases where the first experiment resulted in p<alpha.

  • Of the 25 experiments (where null is actually true), how many of the second experiments are expected to result in p<alpha? 25 * alpha = 1.25
  • Of the 400 experiments (where the alternative is true), how many of the repeat experiments are expected to result in p<alpha? 400 * power = 320
  • Number of second experiments expected to result in p<alpha = 1.25 + 320 = 321.25

Given that the first experiment resulted in p<alpha, the chance that a second identical experiment will also result in p<alpha = 321.25/425 = 0.756

This assumes you set alpha = 0.05 and power = 0.80, and the scientific situation is such that pNull = 0.50. I like to think things out verbally, but of course, this can all be compressed into equations. But my question is if this straightforward approach has already been published.

  1. Goodman, S. N., 1992, A comment on replication, P-values and evidence: Statistics in Medicine, v. 11, no. 7, p. 875–879, doi:10.1002/sim.4780110705.
  • Prior probability of what equals $0.50?$ – Dave Apr 10 '23 at 17:12
  • By "prior probability" I assume that you expect the null is true in half of the experiments? In that case shouldn't "Total second experiments with p<0.05 = 1.25/2 + 320/2 = 321.25/2 = 160.625"? In any case, I'm not sure why you want to make this half-and-half assumption. – dipetkov Apr 10 '23 at 17:38
  • @dave. Yes, I assume that the prior probability of the alternative hypothesis is 50% (so the prior of the null is also 50%). Of course, the same logic works with any value. If you are working in a well established field, the prior might be 90%. If you are doing a wild experiment with no prior data or theory, the prior might be 1%. – Harvey Motulsky Apr 10 '23 at 18:07
  • @dipetkov. The prior of 50% means in half of the first experiment, the null is true. But we run the repeat experiment only if p<0.05 in the first experiment, and that selects against cases where the null is true. – Harvey Motulsky Apr 10 '23 at 18:09
  • 1
    Okay. I still find this confusing: Why mention p-values at all? Your question seems to be about replicating the reject/don't reject $H_0$ conclusion. (I think the Goodman 1992 paper is about replicating the p-value: observing either $p_2 = p_1$ or an even smaller $p_2$ during the second experiment.) – dipetkov Apr 10 '23 at 18:23
  • Yes, the question is about replicating a "reject the null hypothesis because the p value is less than alpha". But Goodman started with a particular p-value (not a range and not an inequality). I'm starting with all first experiments that reject and ask how likely the second experiment is to reject H0. (Also, I changed "p<0.05" to "p<alpha" in a bunch of places to be more general.) – Harvey Motulsky Apr 10 '23 at 18:26
  • The complexity is that the power is computed for a fixed $\alpha$ level (and a fixed true effect of course). But the p-value (conditional on the reject/don't reject decision) is not fixed. Anyway, I think your formulation is simpler and it will be even clearer without mentioning p-values at all. – dipetkov Apr 10 '23 at 18:31
  • @dipetkov. I don't follow. I did edit the question to replace "0.05" with "alpha" in a few cases to keep it general. But I don't think t hat is enough to satisfy your objection, but I don't really understand what you are looking for.... – Harvey Motulsky Apr 10 '23 at 18:56
  • "Prior probability = 0.50" is an extraordinary assumption even in a Bayesian context. I am not in the least surprised that such a situation might not be generally discussed: it wouldn't be relevant to most applications of NHST. Moreoverk, in a Bayesian context nobody would be reporting a p-value: they would report the posterior distribution. – whuber Apr 10 '23 at 19:20
  • @whuber I'm happy to recompute for a different value of prior. What do you suggest? – Harvey Motulsky Apr 10 '23 at 19:30
  • 1
    I suggest not using a prior at all when you are doing classical NHSTs, because they don't use priors. That basically nullifies your question, because it becomes unanswerable. Instead, the classical approach would be to ask about the chance of obtaining two p-values less than $0.05$ in two independent experiments. That depends on the state of nature and can readily be computed from power of the test. E.g., under $H_0,$ the chance is at most $0.05\times 0.05.$ – whuber Apr 10 '23 at 19:58
  • Seems to me that you might be missing the most educative feature of the setup: learning from the first experiment. If the first run gave a p-value close to 0.05 then the evidence for a false null is only modest and a well designed second experiment would have a large sample size, which would yield a higher power at the original assumes alternative hypothesised effect size. – Michael Lew Apr 10 '23 at 21:51
  • @MichaelLew. Clever idea to rethink sample size computation for a repeat experiment. But my goal is simpler. When one experiment rejects H0 and the other doesn't, people ask about why the lack of reproducibility. My question is basically: how often will this happen just by chance? – Harvey Motulsky Apr 11 '23 at 02:52
  • So do you now accept that your questions in your response above mean that no priors of any type are involved? – Graham Bornholt Apr 11 '23 at 04:34
  • 2
    As an aside, the setup you describe is sometimes called the "two trials rule" (when there is one treatment to test rather than 1000). It's not an efficient design. This is discussed in a few sections of Statistical Issues in Drug Development which you might find relevant. – dipetkov Apr 11 '23 at 04:49
  • 1
    Your basic question has an easy answer. If the true situation produces a p-value less than $0.05$ with a chance of $q_1$ on the first trial and $q_2$ on the second trial, then independence of the two trials tells you the chance of first getting a significant result and then failing to repeat it immediately is $q_1(1-q_2).$ The $q_i$ depend on the effect size and the powers of the two experiments to detect it. When $H_0$ holds, $q_1=q_2=0.05$ by design. Note that in one-sided tests either or both $q_i$ often are larger than $0.05.$ – whuber Apr 11 '23 at 13:18
  • 3
    Here is another similar question: https://stats.stackexchange.com/questions/268524/given-p-0-05-in-one-study-what-is-the-likelihood-of-p-ge-0-05-in-a-repeate – Karolis Koncevičius Apr 11 '23 at 18:56
  • @KarolisKoncevičius The linked thread(s) are informative, thank you for pointing them out. But they all ask questions! Surely, the condition "given p = 0.05 in one study" is actually very different from "given that $H_0$ is rejected in one study" and still different from "given that we obtained p-value $p$ in one study". It seems to me, all frameworks for hypothesis testing are getting mixed and matched (and they weren't designed to do so): Neyman-Pearson approach (reject/don't reject), Fisher's (p-value), Bayesian. – dipetkov Apr 11 '23 at 20:07
  • @whuber and H. Motusky: I do like the "prior distribution" setup here. It is legitimate to ask what the frequentist properties are of Bayesian procedures, so why not ask what the properties of a frequentist procedure are in a Bayesian setup? If the true situation is as specified (with frequentist interpretation of the unobserved prior), one cannot take for granted that the analyst is Bayesian, or that they know the true prior! The problem is of course that results will crucially depend on this prior, but one could explore this systematically. – Christian Hennig Apr 17 '23 at 21:34
  • That said, it will be something of a "hard sell" as most people think if you start to write down a prior, analysis has to be Bayesian. But this is not so! – Christian Hennig Apr 17 '23 at 21:36

2 Answers2

5

Regarding the truth of a single experimental hypothesis, there are two cases:

  • In the first case, we assume the null to be true, the false positive error rate $\alpha$ should be assumed to be correct. So given that experiment 1 was a false positive, the probability that experiment 2 is a false positive is still 0.05 - any other value is gambler's fallacy.

  • In the second case, we assume the null to be false - but the precise value of the parameter or estimand is unknown. If the original study is well-powered (say 0.8) and the follow-up is identical, then, if the assumptions are correct, the probability of the replicate showing significance is 0.8. If you update the confirmatory design based on the findings from the first study, it may be less likely to reproduce p<0.05 because the initial study results are known to be favorable given that the hypothesis test was significant.

Your arithmetic presentation doesn't make sense in the context of a single hypothesis because we cannot speak of a heterogeneous "truth" of the null - not without unnecessarily invoking a Bayesian approach.

Regarding a collection of hypotheses such as a scientific body of evidence or a clinical trial repository, you can speak of a distribution frequency of false positives and true positives - assuming negatives primary results are not published, a frequent problem of publication bias. The exact distribution depends on the quality of the science initially performed.

So if there are 10,000 initial attempts at experiments and only 1,000 of these are feasible (null is false, and study is well controlled with 80% power) then there are 1000*0.80 = 800 true positives and 9000 * 0.05 =450 false positives that gain publication. That is, the probability that any given publication is actually correct is 64%. If we replicate all publications, for the 800 true positives they will "replicate p<0.05" with 80% probability so 640 true positive findings confirmed. However, for the 450 false positives, they will only replicate with 5% probability so only 22 false positives confirmed. Overall, there will be a 53% confirmation rate. We can outline the required parameters below:

$$ N(\text{Confirmed}) = N(\text{Studies with } \mathcal{H}_0 \text{ true}) \alpha^2 + N(\text{Studies with } \mathcal{H}_0 \text{ false}) \beta^2$$

where $\alpha$ is the false positive error rate and $\beta$ is the power (note you can easily generalize this to studies with differing $\alpha$s and $\beta$s).

AdamO
  • 62,637
4

One must distinguish between two cases.

  1. P1<alpha what is the probability that P2<alpha
  2. P1=alpha what is the probability that P2<alpha.

Goodman treats the second case. For the second case the answer is plausibly 1/2 and it does not really depend on using P-values. Take any two statistics, say S1 and S2. Assume that nothing else is known. Then probability (S2<S1) = probability (S1<S2). Of course in practice any Bayesian having observed a particular value for S1 may think differently, but that would depend on something else being known or believed. There is a published commentary of mine in Stats in Med and it is also treated in Statistical Issues in Drug Development

References
SENN, S. J. 2002. A comment on replication, p-values and evidence by S.N.Goodman, Statistics in Medicine 1992; 11:875-879. Statistics in Medicine, 21, 2437-44.
SENN, S. J. 2021. Statistical Issues in Drug Development, Chichester, John Wiley & Sons.

Michael Lew
  • 15,102
  • 2
    +1. But of course you assume the distributions of the $p_i$ are continuous. – whuber Apr 12 '23 at 16:35
  • @whuber I think Dr Senn's point is that one interpretation of "replication" is setting the $\alpha$ for study 2 to the $p$ from study 1. This is definitely a way to engineer regression of the mean into the equation. I also point this out in my answer that if you condition on a study being a "success" (whatever that means...) you're going to sabotage replication by updating any aspect of the replicate-design based on initial findings. This includes changing alpha. But I don't think many folks would agree with setting $\alpha$ to any other value than the inital $\alpha$. – AdamO Apr 12 '23 at 17:28
  • 2
    @AdamO Yes, but concluding "the answer is plausibly $1/2$" requires continuity of the distribution of $p$ at the value $\alpha.$ This is not the case for many tests, including in Binomial settings and for many rank-based nonparametric procedures. – whuber Apr 12 '23 at 18:04
  • 1
    @whuber well said. To be fair, the conventional process of significance testing is generally ignorant on this point! So yes, my comment isn't adding much, and this assumption (continuous p) needs to be laid out to begin meaningfully treating this problem. – AdamO Apr 12 '23 at 18:56