How can the combined p-value be larger than one of the original two p-values?

Question

I have two p-values from independent test which I wish to combine. I do this using fisher's method, but I get some results that I don't quite understand. If I have two p-values, and combine them, this can result in a larger p-value than one of the original ones:

from scipy.stats import combine_pvalues
p1 = 0.25
p2 = 0.04
statistic,p_val = combine_pvalues([p1,p2])
print(statistic)
print(p_val)
>>> 9.210340371976182
>>> 0.05605170185988095

From wikipedia: a p-value "is the probability of obtaining test results at least as extreme as the result actually observed, under the assumption that the null hypothesis is correct". I probably am miss understanding something, but if the probability of obtaining the result of the second test is only 0.04, then I don't see how the probability of obtaining BOTH results can now be larger, i.e. 0.056.

How can this be?

The simple formula I give at https://stats.stackexchange.com/a/314739/919 offers some insight: the combined p-value is the product of the p-values times a correction factor that can be large. Take care in interpreting "as extreme:" it's not the same for the combined test as it is for the individual tests! — whuber, Sep 05 '23 at 14:47
Not only can Fisher's method produce a value bigger than one of the p-values, it can be bigger than both - e.g. two p-values of $0.3$ results in a Fisher's method p-value of about $0.307$. — Glen_b, Sep 05 '23 at 17:40
@Glen Interesting point! Indeed, the combined p-value can be as much as 1.21306... times greater than the larger of the two p-values. (This extreme is reached when both p-values are close to 0.60655.) It can also be an arbitrarily small fraction of the larger of the two p-values (attained as they both approach 0). — whuber, Sep 05 '23 at 20:20
I recall playing with $k$ p-values many years ago (some time in the 80s), it's interesting to see what happens when they're all in the vicinity of $0.5$. [If I remember right, I was trying to come up with an intuitive explanation for what the combined test was "seeing" to keep making the overall $p$ larger as $k$ grew when the individual $p$'s were all of middling size. I don't presently remember what my 'intuitive' explanation was in the end, but I could probably go look it up, or start thinking about it again and probably reconstruct it.] — Glen_b, Sep 05 '23 at 22:37

AdamO · Answer 1 · 2023-09-05T13:14:01.137

Fisher's method for independent p-values works like so: back-calculate the chi-square test statistic that would have produced such a value, sum these chi-square values together, and recalculate the p-value according to the new chi-square distribution with updated degrees of freedom: i.e. $2k$ or two times the number of tests.

Let's see if we can replicate Python's result.

> x <- -2*sum(log(0.25) + log(0.04))
> x
[1] 9.21034
> pchisq(x, lower.tail = F, df=4)
[1] 0.0560517

Based on a quick read of this, one should not view Fisher's combined $p$ as a method to pool results, even if the inference is described as a kind of meta-analysis (note: pooling and meta-analysis are also, in fact, overlapping but distinct methods). The Fisher method reduces a lot of $p$-values to one $p$-value which is interpreted as a simultaneous test, a kind of way to reduce FDR (false discovery rate). $p<0.05$ means at least one null hypothesis is false. If all null hypotheses are true, we expect Fisher's combined $p$ will still conserve the nominal alpha level of the test.

If the first sample is very consistent with the null hypothesis, it will drag attenuate the combined $p$-value. This seems intuitive with the test description.

You could explore the operating characteristics of using $$\text{sup}_{i; i \le k) (p_i)$$ as a combined $p$-value. But I doubt this would have a well conserved false positive error rate.

I don't think I can understand how a $p$ would get more precise except if what you are looking for is a kind of pooled test.

I understand what operations are being performed, but I don't understand the results: Could you explain how the probability of two observations occurring can be larger than the probability of either of those probability occurring? This would only make sense to me if the two observations were somehow dependent, but Fisher's method explicitly implies that they are independent. — Chris_abc, Sep 05 '23 at 13:13

score 0 · Answer 2 · answered Sep 05 '23 at 20:13

Suppose that the null hypotheses were both true. In that case, the combined $p$-value should have a uniform distribution on [0,1], as should each of the two separate $p$-values.

If the combined p-value were always lower than both the separate $p$-values, it could not be uniform on [0,1]. In fact, even if the combined $p$-value were always smaller than the larger of the two separate $p$-values it couldn't be uniform on [0,1].

It's not even true that if both null hypotheses are false the combined $p$-value is smaller than the two separate $p$-values (if this were true, it would imply an unrealistically effective way to tell whether the null hypotheses were false).

What is true is much weaker. If null hypothesis 1 is false by a sufficiently large margin (ie, if the test has sufficient power), then the combined p-value will be smaller than the separate $p$-value for null hypothesis 2 (and vice versa). Working out what 'sufficiently large' means takes more analysis or simulation.

How can the combined p-value be larger than one of the original two p-values?

2 Answers2