6

Having found nice formulas for testing the null hypothesis under exponentially-distributed samples, I wanted to see how well permutation tests could do the job. And the answer, assuming no mistakes, appears to be: very poorly! Are permutation tests really this weak, or is something wrong here?

The challenge: Given two sample sets {x} and {y} from exponentially distributed variables, compute the p-value under the null hypothesis $H_0$ (i.e., p = the probability of observing those samples if both sample sets were drawn from the same exponential distribution).

As described in the answer linked above, we have two good test statistics for this:

  1. An F statistic: F = $\frac{\bar{x}}{\bar{y}} \sim F(2n_x, 2n_y)$
  2. A Likelihood Ratio test statistic L = $\frac{\mathcal{L}(H_0)}{\mathcal{L}(H_1)}$

To test things out I chose to look at samples of size 9. I generated the sampling distribution for L under the null hypothesis via simulation, using the following code:

import numpy as np
from scipy.stats import permutation_test, percentileofscore, expon, f

def LRTstat(x, y): """Likelihood Ratio Test statistic for exponential distribution""" mx = np.mean(x) my = np.mean(y) m0 = np.mean(np.concatenate((x, y))) # Likelihood of these samples under null hypothesis of equal means: L0 = np.prod(expon.pdf(np.concatenate((x, y)), scale=m0)) # Likelihood of these samples under the alternative hypothesis of different means: L1 = np.prod(expon.pdf(x, scale=mx)) * np.prod(expon.pdf(y, scale=my)) return L0 / L1

nx, ny = (9, 9) nsim = 10_000 L = np.zeros(nsim) # Likelihood ratio simulations for i in range(nsim): x = np.random.exponential(size=nx) y = np.random.exponential(size=ny) L[i] = LRTstat(x, y)

Now for any sample, I compute the p values using both statistics as follows:

# Generate random samples
x = np.random.exponential(size=nx)
y = np.random.exponential(size=ny, scale=0.5)  # Violate null hypothesis to make it interesting
mx = np.mean(x)
my = np.mean(y)
Fstat = mx/my
F_CDF = f.cdf(Fstat, nx*2, ny*2)
F_p = 2*min(F_CDF, 1-F_CDF)
LRT_p = percentileofscore(L, LRTstat(x, y))/100
print(f'Samples ({nx}) MeanX: {np.mean(x):.2f}\t({ny}) MeanY: {np.mean(y):.2f}\n'
      f'F stat: {Fstat:.1f};\tCDF={F_CDF:.1%};\tp={F_p:.1%}\n'
      f'LRT Statistic: {LRTstat(x, y):.3f};\tp={LRT_p:.1%}')

So a typical result from that code is:

Samples (9) MeanX: 1.26   (9) MeanY: 0.36
F stat: 3.5;    CDF=99.5%;  p=1.1%
LRT Statistic: 0.036;   p=1.3%

The difference in p values from the two methods is never greater than 1%.

Now, for the same sample data, I run a permutation test on the L statistic using the scipy implementation. I give it 10,000 iterations, which is 20% of the permutation space in this case (9+9 samples). Here is the code:

res = permutation_test((x, y), LRTstat, vectorized=False,
                       n_resamples=10_000, alternative='two-sided')
print(f'Permutation p={res.pvalue:.1%}')

In this one case it says p=15%, which is a lot more than the <2% given by the other tests! I ran a thousand more tests through the code and the permutation p value is all over the place: Almost always much larger than the other tests for L values < 0.8, and often absurdly so. One scenario with L = 0.14 gave p=96%!

Is this typical or expected performance a permutation test in this application? Is this a particularly bad application?


ETA: A problem with SciPy implementation of two-tailed permutation test?

Studying this further, I discovered that when I switch the scipy.stats.permutation_test alternative parameter from two-sided to less (which is a one-sided test) I get more reasonable results. So it appears that a big part of this question is: what is the SciPy test doing in the two-sided case, and does it make any sense?

Here are two charts from the simulations studying this permutation test implementation. The first is the one that led to this question: Using the code above, I ran 1,000 simulations where {x} is 9 random draws from Exponential(1), and {y} is 9 random draws from Exponential(0.5). For each simulation I record the LRT statistic value and the following three dependent p values:

  1. F_p – the p from the F test
  2. LRT_p – the p from the LRT statistic
  3. permutation_test.pvalue – the p from the SciPy permutation_test()

This first chart shows that the first two values are virtually identical and increase with the LRT statistic, as expected. The third value is shown by the scattered dots and just makes no sense.

Chart comparing p-values from good tests with the p-value from scipy two-sided permutation test

This next chart is the same experiment done on 100 simulations but where the the permutation_test alternative parameter is changed from two-sided to less, which is a one-sided test. It still gives p values that can be quite far from the correct values, but at least it stays correlated to the correct values.

Chart comparing p-values from good tests with the p-value from scipy one-sided permutation test

To reiterate the questions now:

  1. What is the SciPy permutation_test doing in the alternative = two-sided case? (And whatever it is, does it make any sense?)

  2. In the one-sided case, is that level of variation from the correct p value considered reasonable for a permutation test in this sort of scenario? Can it be improved?

feetwet
  • 1,108
  • Irrelevant to your statistical question, note that your function LRTstat is not compliant with PEP 8. – Galen Dec 03 '23 at 02:05
  • 1
    I'm not sure I follow how you're getting the critical value for your likelihood ratio test. I'd also note that I'd be dealing with the F differently from how you have it, though for the case where n1=n2, it won't make a difference. – Glen_b Dec 03 '23 at 10:14
  • @Glen_b I'm still working through understanding the answers here to develop a better two-tailed p-value from the F test. Are there other things I should look at on that topic? – feetwet Dec 03 '23 at 18:51
  • @Glen_b For the LRT: I Monte Carlo the sampling distribution (# Likelihood ratio simulations section of the code) and then look up the rank of the observed LRT stat in those simulations(percentileofscore(L, LRTstat(x, y))). I thought this was essentially how you did it here, but please tell me if I've misunderstood or if there are better approaches. – feetwet Dec 03 '23 at 18:51
  • Please ignore my above comments for the moment; I inadvertently focused on everything but the central question. After I look at what you need looked at, I may end up deleting the comment altogether. – Glen_b Dec 03 '23 at 23:08
  • 1
    AH! I think I see your problem. The CR for the likelihood ratio test statistic is always one tailed (reject for small LRs), as discussed at my answer to the first post in this sequence of posts (on the Rayleigh): "You reject when the likelihood under $H_0$ is small relative to the likelihood under $H_1$ – i.e. for small values of the likelihood ratio" ... permutation distribution or not, if you also reject for large values, you will get a terrible test, because that's the best case for $H_0$. Keep in mind that to a rough approximation $-2\log \Lambda$ is like a chi-squared goodness of fit – Glen_b Dec 04 '23 at 02:17
  • @Glen_b Good, that explains the glaring problem! What about the more subtle problem: IIUC, a permutation test on a likelihood ratio is a UMP test, but in the 100 one-tail simulations I did (last chart) we see plenty of samples on which it gives much lower probability of rejecting the null than the F- and sampling-distribution tests. – feetwet Dec 04 '23 at 22:05
  • 1
  • It's not UMP. 2. Indeed, there isn't a UMP in the two tailed case. 3. However, the permutation test should have asymptotic relative efficiency 1 (relative to the parametric test based on the same statistic, in the Pitman-efficiency sense). So in large samples, power against small effects (where you need it) should be essentially the same; it might not be attained when n=9. Individual-sample p-values may not be similar between the permutation distribution and the exact parametric distribution; it's the overall rejection rate you would compare at some alternative(s).
  • – Glen_b Dec 05 '23 at 01:26