Having found nice formulas for testing the null hypothesis under exponentially-distributed samples, I wanted to see how well permutation tests could do the job. And the answer, assuming no mistakes, appears to be: very poorly! Are permutation tests really this weak, or is something wrong here?
The challenge: Given two sample sets {x} and {y} from exponentially distributed variables, compute the p-value under the null hypothesis $H_0$ (i.e., p = the probability of observing those samples if both sample sets were drawn from the same exponential distribution).
As described in the answer linked above, we have two good test statistics for this:
- An F statistic: F = $\frac{\bar{x}}{\bar{y}} \sim F(2n_x, 2n_y)$
- A Likelihood Ratio test statistic L = $\frac{\mathcal{L}(H_0)}{\mathcal{L}(H_1)}$
To test things out I chose to look at samples of size 9. I generated the sampling distribution for L under the null hypothesis via simulation, using the following code:
import numpy as np
from scipy.stats import permutation_test, percentileofscore, expon, f
def LRTstat(x, y):
"""Likelihood Ratio Test statistic for exponential distribution"""
mx = np.mean(x)
my = np.mean(y)
m0 = np.mean(np.concatenate((x, y)))
# Likelihood of these samples under null hypothesis of equal means:
L0 = np.prod(expon.pdf(np.concatenate((x, y)), scale=m0))
# Likelihood of these samples under the alternative hypothesis of different means:
L1 = np.prod(expon.pdf(x, scale=mx)) * np.prod(expon.pdf(y, scale=my))
return L0 / L1
nx, ny = (9, 9)
nsim = 10_000
L = np.zeros(nsim) # Likelihood ratio simulations
for i in range(nsim):
x = np.random.exponential(size=nx)
y = np.random.exponential(size=ny)
L[i] = LRTstat(x, y)
Now for any sample, I compute the p values using both statistics as follows:
# Generate random samples
x = np.random.exponential(size=nx)
y = np.random.exponential(size=ny, scale=0.5) # Violate null hypothesis to make it interesting
mx = np.mean(x)
my = np.mean(y)
Fstat = mx/my
F_CDF = f.cdf(Fstat, nx*2, ny*2)
F_p = 2*min(F_CDF, 1-F_CDF)
LRT_p = percentileofscore(L, LRTstat(x, y))/100
print(f'Samples ({nx}) MeanX: {np.mean(x):.2f}\t({ny}) MeanY: {np.mean(y):.2f}\n'
f'F stat: {Fstat:.1f};\tCDF={F_CDF:.1%};\tp={F_p:.1%}\n'
f'LRT Statistic: {LRTstat(x, y):.3f};\tp={LRT_p:.1%}')
So a typical result from that code is:
Samples (9) MeanX: 1.26 (9) MeanY: 0.36
F stat: 3.5; CDF=99.5%; p=1.1%
LRT Statistic: 0.036; p=1.3%
The difference in p values from the two methods is never greater than 1%.
Now, for the same sample data, I run a permutation test on the L statistic using the scipy implementation. I give it 10,000 iterations, which is 20% of the permutation space in this case (9+9 samples). Here is the code:
res = permutation_test((x, y), LRTstat, vectorized=False,
n_resamples=10_000, alternative='two-sided')
print(f'Permutation p={res.pvalue:.1%}')
In this one case it says p=15%, which is a lot more than the <2% given by the other tests! I ran a thousand more tests through the code and the permutation p value is all over the place: Almost always much larger than the other tests for L values < 0.8, and often absurdly so. One scenario with L = 0.14 gave p=96%!
Is this typical or expected performance a permutation test in this application? Is this a particularly bad application?
ETA: A problem with SciPy implementation of two-tailed permutation test?
Studying this further, I discovered that when I switch the scipy.stats.permutation_test alternative parameter from two-sided to less (which is a one-sided test) I get more reasonable results. So it appears that a big part of this question is: what is the SciPy test doing in the two-sided case, and does it make any sense?
Here are two charts from the simulations studying this permutation test implementation. The first is the one that led to this question: Using the code above, I ran 1,000 simulations where {x} is 9 random draws from Exponential(1), and {y} is 9 random draws from Exponential(0.5). For each simulation I record the LRT statistic value and the following three dependent p values:
F_p– the p from the F testLRT_p– the p from the LRT statisticpermutation_test.pvalue– the p from the SciPypermutation_test()
This first chart shows that the first two values are virtually identical and increase with the LRT statistic, as expected. The third value is shown by the scattered dots and just makes no sense.
This next chart is the same experiment done on 100 simulations but where the the permutation_test alternative parameter is changed from two-sided to less, which is a one-sided test. It still gives p values that can be quite far from the correct values, but at least it stays correlated to the correct values.
To reiterate the questions now:
What is the SciPy
permutation_testdoing in thealternative = two-sidedcase? (And whatever it is, does it make any sense?)In the one-sided case, is that level of variation from the correct p value considered reasonable for a permutation test in this sort of scenario? Can it be improved?



LRTstatis not compliant with PEP 8. – Galen Dec 03 '23 at 02:05# Likelihood ratio simulationssection of the code) and then look up the rank of the observed LRT stat in those simulations(percentileofscore(L, LRTstat(x, y))). I thought this was essentially how you did it here, but please tell me if I've misunderstood or if there are better approaches. – feetwet Dec 03 '23 at 18:51