What can I do if I want to apply Fisher's method (or some other method of combining p-values), but my p-values have unknown (positive) dependence?

Question

I have p-values from several tests (k~50) that I'd like to combine. I don't have a model of their correlation structure. They aren't independent. In the worst-case scenario, my effective sample size might be more more like 15.

I'm dealing with marginally significant results (About 10% remain significant after a two-stage Benjamini Hochberg FDR correction for α=0.05). So, I need to be very careful to both avoid a false positive and to avoid wasting statistical power.

Option 1: Bonferroni argument

The Bonferroni correction is considered conservative when considering multiple tests, where any single positive test is taken as a positve result.

You usually divide your p-value threshold by $k$ for this, but it is equivalent to multiple the p-value by $k$. This is really a linear approximation when $p\ll 1$. The real quantity you want to use is $\tilde p = 1-(1-p)^k$, which gives you the probability of seeing at least one test pass under the null distribution.

By this argument, if I think that my effective sample size is no less than $k/3$, I could correct my p-values to $\tilde p = 1-(1-p)^3$, then apply Fisher's method. On my data, this gives something like $p_{combined}$ = 5×10^-3.

Option 2: Fisher's method argument

My understand is that Fisher's method: (1) converts the p-values to the sum-of-squares $\sum_{i=1}^k\|z_i\|^2$ quantity of a χ² distribution (2) sums these up, then (3) tests whether the result looks significant for a χ² with DoF = $2k$. On my data, this gives something absurd like 1×10^-11, which is certainly spurious.

My intuition is this: Say that my data secretly consisted of individual tests replicated three times each. Each of this would give a (possibly noisy) estimate of the same p-value.

I could hypothetically average these together into $k/3$ tests, then apply Fishers method, testing against a χ² with DoF = $2k/3$.

Instead of testing whether $\sum_{i=1}^k\|z_i\|^2$ looks significant with a χ² with $2k$ DoF, I coulld test whether $(\sum_{i=1}^k\|z_i\|^2)/3$ looks significant with a χ² with $2k/3$ DoF. On my data, this gives $p_{combined}$ = 1×10^-5.

Option 3: Preprocessing with Benjamini Hochberg

Say I trust the corrected p-values froma two-stage Benjamini Hochberg FDR correction (α=0.05). I think this is fine, since this FDR correction depends on the distribution of p-values, and doesn't require independence? (correct me if this is wrong).

I can run FDR correction first, then apply Fisher's method to combine the corrected p-values. On my data, this gives $p_{combined}$ = 2×10^-3.

Question: Are any of these methods considered kosher and, if so, does anyone know a reference to back them up? (Other solutions very welcome, of course).

I don't have a clear picture of what your data are, what your inferential objectives are, and why you would be combining p-values. Can you give a bit of detail? — Michael Lew, Feb 18 '23 at 20:21
I'm interested in general strategies for combining p-values from tests with unknown positive correlations (but where perhaps loose upper and lower bounds can be provided on these correlation). — MRule, Feb 18 '23 at 20:27
People commonly combine p-values when they've performed several tests, non of which is individually significant, but seem to be "almost significant" in some systematic way. This can allow a person to make claims about excess prevalence of the effect in the population. — MRule, Feb 18 '23 at 20:29
However, the standard methods for this that I've found so far require that the tests be independent. A fair assumption, but not always true. — MRule, Feb 18 '23 at 20:30
Correlations in time are a common example of why someone would have tests with unknown correlations. Perhaps you perform the same test every day on a piece of equipment, which you suspect is introducing some systematic time-varying biases. Perhaps you're fairly certain these biases will be uncorrelated for samples processed a week apart, but have a suspicion that samples processed on consecutive days might be correlated. — MRule, Feb 18 '23 at 20:33
Now, you could build a complicated model of the autocorrelation, but that's a difficult open-ended problem that will take time, and your on a deadline. Instead, you reasonably assume that the tests are noisy measurements of some true underlying effect, but with unknown correlations. You know you can't apply Fisher's method, but you also know that things are that correlated and if you could make some reasonable assumptions, it would be safe to combine p-values. — MRule, Feb 18 '23 at 20:35
Now, my specific case isn't actually a time-series, its spatiotemporal data, but I introduced the time series example because I want to keep this as general as possible so that it can help people with problems that seem structurally dissimilar on the surface. — MRule, Feb 18 '23 at 20:35
The general idea of "What can I do if I want to apply Fisher's method (or some other method of combining p-values), but my p-values have unknown positive correlations" is very clear and sufficiently general to be useful on this forum, I think. Maybe I'll change the title, thanks! — MRule, Feb 18 '23 at 20:36
If you use R you may want to investigate the CRAN package poolr which has a number of methods. The authors are gradually publishing the results from Cinar's PhD thesis. Full disclosure: I was on the panel for his thesis defence. — mdewey, Feb 19 '23 at 14:15

Mangnier Loïc · Accepted Answer · 2023-02-20T13:03:29.720

Combining p-values within statistical framework is a flourishing research area, mostly in genetics or in public health, when dealing with meta-analyses for example.

Basically, p-value combination can be reached using two kind of methods, asymptotic and resampling-based approaches, respectively. I assume you are comfortable enough with these families of methods to move forward and present 2 good options for your case (3 actually, but the latter is just a combination of the two first). I will focus my attention on the ACAT (Liu et Lin, 2019) and minP (Westfall & Young, 1993) procedures, while providing more a practical guide focusing on general concepts. I strongly recommend you to read original papers if you want technical details.

Aggregated CAUCHY Association Test a.k.a ACAT

ACAT is a very recent method, initially proposed by Liu & Xie, 2019. Basically, the method is similar to Fisher's method since it is built upon a linear combination of weighted p-values. Following the original paper, correlation structure does not impact much the Type I error rate because of the heaviness of the Cauchy distribution tails. One other argument in favor of ACAT is to maximize the power by not adjusting any degrees of freedom of the test (unlike Fisher's method). Finally, by providing elegant asymptotic results from the Cauchy distribution (the sum of standard Cauchy random variables is a standard Cauchy), computation times are fast which is very advantageous in practice.

The method can be implemented in your favorite programming language, but if you are working in R, the method is available here: https://github.com/yaowuliu/ACAT

minP

minP is a re-sampling procedure. minP is a two-stage method, using minimal p-value as test statistic (Dudoit & Van der Laan, 2008). Roughly speaking you need to resample your data to obtain your test statistic then resample them to obtain your p-values. Two advantages I see here, the resampling nature of the procedure respects your correlation structure observed in your original data (even if it's very strong), while maximizing the power by choosing the minimal p-value. Sandrine Dudoit has written lot of good papers on standard permutation-based approaches. It is worth noting that in presence of a large number of p-values, the method can be practically unfeasible, even though adaptations can be used in order to reach faster results (kurtosis adapted approaches for example, Lee et al., 2014).

Hybrid approach , MinP-CCT-MinP

This is the more recent method. Since I did not work with it, I will just mention that Chen, 2022 showed that minP is more powerful than ACAT under certain correlation structures, such as autoregressive models for example.

The code is available here: https://github.com/zchen2020/Robust-P-value-combination-tests

Finally I provide a short guide of good practices when dealing with combination of p-values.

Good practices Generally speaking, in many cases, these approaches will be enough to deal with a large set of problems.

If you have a large number of p-values and you do not expect large correlations across your p-values, ACAT should be a good option. You will obtain fast results with a good control in your distribution tail (small p-values).
However, if you do expect strong correlations, ACAT should be not enough. Using resampling-based approaches seems the way-to-go.

N.B: Recent methods use saddlepoint approximation to reach more control in distribution tails. From my knowledge, the method is not implemented for general purpose, but only for very specific applications.

Finally, if you expect specific correlation structures, probably using hybrid technique will lead to powerful results with good Type I error control.

Anyway, I could illustrate the pros and cons of the methods simulating some data. Let me know if you are interested.

Hope this helps.

It looks like this ACAT thing reduces to a nonlinear averaging of p-values. If $p$ is my vector of p-values, then cauchy(0,len(p)).sf(sum(cauchy.ppf(1-p))) == cauchy(0,1).sf(mean(cauchy.ppf(1-p))) (in python, with from scipy.stats import cauchy). On my data, this gives $\tilde p=0.08$, compared to say... taking the harmonic mean of the p-values (which has been suggested elsewhere), which gives $\tilde p = 0.05$. I guess my question is: you mention not tweaking the DoF with ACAT, but as far as I can tell, ACAT is invariant to DoF (unlike Fisher's method?). What did I miss? — MRule, Feb 20 '23 at 12:46
@Glen_b You're right, I fixed it in my answer. Here we are indeed working with Random variables (transformed p-values) from a standard Cauchy distribution. Thanks for the clarification. — Mangnier Loïc, Feb 20 '23 at 13:05
@MRule Like you correctly said, ACAT is DoF invariant. One advantage here I see is that you need to include only once your DoF (when computing your p-values). If I can provide some context, in genetics our models are commonly from Chisq statistic with 1 DoF. Using ACAT is advantageous here, you have more power by not adjusting your DoF. Indeed, increasing your DoF reduces power in many cases. Did I answer your question ? — Mangnier Loïc, Feb 20 '23 at 13:14
I'm not sure, but it's good to know that I don't need to attend to DoF for the ACAT method. In my experience Cauchy tails are often a bit too heavy and will leak a bit too much power, but it makes sense as a conservative way to average them. I'll report to my collaborators that no amount of averaging is going to conjure significance :P. Now, I'm not sure why genomics would have a χ² with 1 DoF, but that's another question for another time. As an aside, the way stats is taught in undergrad/grad school appears to be fantastically useless for understanding statistician's language in real life! — MRule, Feb 20 '23 at 13:32

What can I do if I want to apply Fisher's method (or some other method of combining p-values), but my p-values have unknown (positive) dependence?

Option 1: Bonferroni argument

Option 2: Fisher's method argument

Option 3: Preprocessing with Benjamini Hochberg

1 Answers1