Permutation test for exponential null hypothesis: really bad?

Question

Having found nice formulas for testing the null hypothesis under exponentially-distributed samples, I wanted to see how well permutation tests could do the job. And the answer, assuming no mistakes, appears to be: very poorly! Are permutation tests really this weak, or is something wrong here?

The challenge: Given two sample sets {x} and {y} from exponentially distributed variables, compute the p-value under the null hypothesis $H_0$ (i.e., p = the probability of observing those samples if both sample sets were drawn from the same exponential distribution).

As described in the answer linked above, we have two good test statistics for this:

An F statistic: F = $\frac{\bar{x}}{\bar{y}} \sim F(2n_x, 2n_y)$
A Likelihood Ratio test statistic L = $\frac{\mathcal{L}(H_0)}{\mathcal{L}(H_1)}$

To test things out I chose to look at samples of size 9. I generated the sampling distribution for L under the null hypothesis via simulation, using the following code:

import numpy as np
from scipy.stats import permutation_test, percentileofscore, expon, f
def LRTstat(x, y):
    """Likelihood Ratio Test statistic for exponential distribution"""
    mx = np.mean(x)
    my = np.mean(y)
    m0 = np.mean(np.concatenate((x, y)))
    # Likelihood of these samples under null hypothesis of equal means:
    L0 = np.prod(expon.pdf(np.concatenate((x, y)), scale=m0))
    # Likelihood of these samples under the alternative hypothesis of different means:
    L1 = np.prod(expon.pdf(x, scale=mx)) * np.prod(expon.pdf(y, scale=my))
    return L0 / L1
nx, ny = (9, 9)
nsim = 10_000
L = np.zeros(nsim)  # Likelihood ratio simulations
for i in range(nsim):
    x = np.random.exponential(size=nx)
    y = np.random.exponential(size=ny)
    L[i] = LRTstat(x, y)

Now for any sample, I compute the p values using both statistics as follows:

# Generate random samples
x = np.random.exponential(size=nx)
y = np.random.exponential(size=ny, scale=0.5)  # Violate null hypothesis to make it interesting
mx = np.mean(x)
my = np.mean(y)
Fstat = mx/my
F_CDF = f.cdf(Fstat, nx*2, ny*2)
F_p = 2*min(F_CDF, 1-F_CDF)
LRT_p = percentileofscore(L, LRTstat(x, y))/100
print(f'Samples ({nx}) MeanX: {np.mean(x):.2f}\t({ny}) MeanY: {np.mean(y):.2f}\n'
      f'F stat: {Fstat:.1f};\tCDF={F_CDF:.1%};\tp={F_p:.1%}\n'
      f'LRT Statistic: {LRTstat(x, y):.3f};\tp={LRT_p:.1%}')

So a typical result from that code is:

Samples (9) MeanX: 1.26   (9) MeanY: 0.36
F stat: 3.5;    CDF=99.5%;  p=1.1%
LRT Statistic: 0.036;   p=1.3%

The difference in p values from the two methods is never greater than 1%.

Now, for the same sample data, I run a permutation test on the L statistic using the scipy implementation. I give it 10,000 iterations, which is 20% of the permutation space in this case (9+9 samples). Here is the code:

res = permutation_test((x, y), LRTstat, vectorized=False,
                       n_resamples=10_000, alternative='two-sided')
print(f'Permutation p={res.pvalue:.1%}')

In this one case it says p=15%, which is a lot more than the <2% given by the other tests! I ran a thousand more tests through the code and the permutation p value is all over the place: Almost always much larger than the other tests for L values < 0.8, and often absurdly so. One scenario with L = 0.14 gave p=96%!

Is this typical or expected performance a permutation test in this application? Is this a particularly bad application?

ETA: A problem with SciPy implementation of two-tailed permutation test?

Studying this further, I discovered that when I switch the scipy.stats.permutation_test alternative parameter from two-sided to less (which is a one-sided test) I get more reasonable results. So it appears that a big part of this question is: what is the SciPy test doing in the two-sided case, and does it make any sense?

Here are two charts from the simulations studying this permutation test implementation. The first is the one that led to this question: Using the code above, I ran 1,000 simulations where {x} is 9 random draws from Exponential(1), and {y} is 9 random draws from Exponential(0.5). For each simulation I record the LRT statistic value and the following three dependent p values:

F_p – the p from the F test
LRT_p – the p from the LRT statistic
permutation_test.pvalue – the p from the SciPy permutation_test()

This first chart shows that the first two values are virtually identical and increase with the LRT statistic, as expected. The third value is shown by the scattered dots and just makes no sense.

This next chart is the same experiment done on 100 simulations but where the the permutation_test alternative parameter is changed from two-sided to less, which is a one-sided test. It still gives p values that can be quite far from the correct values, but at least it stays correlated to the correct values.

To reiterate the questions now:

What is the SciPy permutation_test doing in the alternative = two-sided case? (And whatever it is, does it make any sense?)
In the one-sided case, is that level of variation from the correct p value considered reasonable for a permutation test in this sort of scenario? Can it be improved?

Irrelevant to your statistical question, note that your function LRTstat is not compliant with PEP 8. — Galen, Dec 03 '23 at 02:05
I'm not sure I follow how you're getting the critical value for your likelihood ratio test. I'd also note that I'd be dealing with the F differently from how you have it, though for the case where n1=n2, it won't make a difference. — Glen_b, Dec 03 '23 at 10:14
@Glen_b I'm still working through understanding the answers here to develop a better two-tailed p-value from the F test. Are there other things I should look at on that topic? — feetwet, Dec 03 '23 at 18:51
@Glen_b For the LRT: I Monte Carlo the sampling distribution (# Likelihood ratio simulations section of the code) and then look up the rank of the observed LRT stat in those simulations(percentileofscore(L, LRTstat(x, y))). I thought this was essentially how you did it here, but please tell me if I've misunderstood or if there are better approaches. — feetwet, Dec 03 '23 at 18:51
Please ignore my above comments for the moment; I inadvertently focused on everything but the central question. After I look at what you need looked at, I may end up deleting the comment altogether. — Glen_b, Dec 03 '23 at 23:08
AH! I think I see your problem. The CR for the likelihood ratio test statistic is always one tailed (reject for small LRs), as discussed at my answer to the first post in this sequence of posts (on the Rayleigh): "You reject when the likelihood under $H_0$ is small relative to the likelihood under $H_1$ – i.e. for small values of the likelihood ratio" ... permutation distribution or not, if you also reject for large values, you will get a terrible test, because that's the best case for $H_0$. Keep in mind that to a rough approximation $-2\log \Lambda$ is like a chi-squared goodness of fit — Glen_b, Dec 04 '23 at 02:17
@Glen_b Good, that explains the glaring problem! What about the more subtle problem: IIUC, a permutation test on a likelihood ratio is a UMP test, but in the 100 one-tail simulations I did (last chart) we see plenty of samples on which it gives much lower probability of rejecting the null than the F- and sampling-distribution tests. — feetwet, Dec 04 '23 at 22:05

Glen_b · Answer 1 · 2023-12-03T23:12:52.087

Edit: This answer doesn't deal with the actual meat of the question, which was about permutation tests. I will either fix it or delete it.

The likelihood ratio test looks pretty good to me and essentially indistinguishable from the F test.

I've just computed power curves for the n1=n2=9 case here (alpha is 5%). The F test and likelihood-ratio-tests are almost perfectly equivalent (i.e. they almost always reject / fail to reject the same cases, which I haven't shown here, but an analysis like the one I did in my answer to your first Rayleigh two-sample test question shows it). The very small differences -- which you can't see in the plot -- are likely simply due to inexact LRT critical values being used (since I compute them from simulation).

I have included my code below but I am happy to explain anything that might be unclear about the general approach. The basic idea is to find the critical value for the LRT first (since that comes under H0) and then use that in the subsequent power simulations. Otherwise the code is very similar to my answer to the first post on the Rayleigh.

The code runs pretty quickly; more points could be done with little difficulty.

Linear interpolation on the probit scale suggests that 50% and 80% power would be attained at scale ratios of about 2.57 and 3.84 respectively (for either test in this situation). Simulation in that vicinity indicates scale ratios of a little above 2.59 and 3.88 should attain those values for power.

r code for power simulation. The code was done quickly and is unpretty, but does the job for a simple check:

    # Step 1: compute critical values
    nsim=1000000
    n1=n2=9
    between=function(x,v) ((x>v[1])&(x<v[2]))
    FC=qf(c(.025,.975),n1*2,n2*2) # F critical values
res=replicate(nsim,{
  x=rexp(n1);y=rexp(n2)
  mx=mean(x);my=mean(y)
  m0=mean(c(x,y))
  logL0=sum(dexp(c(x,y),1/m0,log=TRUE))
  logL1=sum(dexp(x,1/mx,log=TRUE))+sum(dexp(y,1/my,log=TRUE))
  c(logLam=logL0-logL1,F=mx/my)
 })

# standard error of 0.05 significance level for estimated CV is 0.000218
CV=quantile(res[1,],p=.05) # estimate 5 % point of log-LR under H0 

# quick reasonableness check -- compare 5% CV with Wilks  
c(LLR=CV,Wilks=-qchisq(.95,1)/2)

# Step 2: do power curve
nsim=100000
r=c(0.1,0.2,0.5,0.667,0.8,0.9,1,1.111,1.25,1.5,2,5,10) # seq. of alternatives
rr=matrix(data=NA,length(r),2)
for(i in seq_along(r)){
 res=replicate(nsim,{
  x=rexp(n1);y=rexp(n2,r[i])
  mx=mean(x);my=mean(y)
  m0=mean(c(x,y))
  logL0=sum(dexp(c(x,y),1/m0,log=TRUE))
  logL1=sum(dexp(x,1/mx,log=TRUE))+sum(dexp(y,1/my,log=TRUE))
  c(logLam=logL0-logL1,F=mx/my)
 })
 rr[i,]=c(mean(res[1,]&lt;CV),mean(!between(res[2,],FC)))
}

plot(r,rr[,1],log=&quot;x&quot;,pch=&quot;o&quot;,ylab=&quot;power&quot;,xlab=&quot;rate parameter ratio&quot;,    
      main=&quot;Power for F and LRT,exponential,n1=n2=9&quot;)
points(r,rr[,2],col=2,pch=&quot;+&quot;)
text(5,0.17,&quot;o - LRT&quot;,pos=4,col=1)
text(5,0.1,&quot;+ - F&quot;,pos=4,col=2)

So it looks like you're confirming my confirmation of your original answer, which is that these LRT and F-test statistics are equally powerful for this hypothesis test. I don't understand if/how the power curve answers my primary question here, which is why (or whether we should expect) a permutation test on the LRT for a specific set of sample data does not give p-values anywhere close to the ones from the LRT and F statistic distributions? — feetwet, Dec 03 '23 at 18:35
My apologies, I completely failed to address anything about permutation tests. — Glen_b, Dec 03 '23 at 23:04
But this is useful analysis for the original question. Please don't delete it altogether; maybe post as a second answer there? — feetwet, Dec 03 '23 at 23:23

Permutation test for exponential null hypothesis: really bad?

ETA: A problem with SciPy implementation of two-tailed permutation test?

1 Answers1