Permutation Analysis using unequal samples

Question

I need to conduct a permutation analysis on unequal samples, one (X) with 579 data points, and another (Y) with 1289 data points. There is an obvious difference in the data. However, this difference may be due to the imbalance between sample sizes. I need to run a permutation analysis using all possible pairs (combinations) of these data points to see what the likelihood that these differences are real and not artifact due to the imbalance. The results need to give me p-values to interpret for significance testing. I am very new to R. Also, most everything I have found gives me paired-samples. Does anyone have code or suggestions to do this?

Thanks.

What exactly are you comparing between the two groups? What is this "obvious difference" and how are you measuring it? Also, please be careful with your terminology, because a "pair" and a "combination" and a "permutation" are all very different things. A pair literally is just two things; a combination is a subset of things; and a permutation is a re-ordering of all of them. Because you seem to have used these three terms as if they were equivalent, please consider editing this question to clear up that ambiguity. — whuber, Jun 27 '14 at 17:43

Glen_b · Answer 1 · 2014-06-28T06:48:44.047

Your data are apparently not paired. You should not be attempting to pair unpaired data.

With (presumably) independent samples, the usual form of permutation test simply permutes the group labels.

With your sample sizes, a full permutation test would usually be impractical (unless the sample difference is fairly extreme, in which case a complete enumeration of the tail may be feasible).

As such we'd usually be looking at a randomization test.

In the independent samples case, when testing for a difference in means that consists of any statistic yielding equivalent p-values (i.e. providing the same ordering of samples as a difference in means).

The sum of the values in the smaller sample would be sufficient (the difference in means is a simple linear transformation of this).

So the randomization test would consist of selecting random sets of 579 values from the combined sample of (579+1289) points and computing their sum, and then locating the sample value in that distribution and identifying the proportion of statistics at least as extreme as the observed one - counting the observed one.

With a two-tailed test, you can compute a sum of values in the smaller sample for the difference in means having the opposite sign to the observed and make teh same count of the proportion more extreme in the other tail.

The independent-samples randomization test is a pretty standard test. In R you should be able to do it using the coin package, but writing code for it is pretty simple (though possibly a good deal slower than using a function in a well-built package).

Consider the following data (this is small enough to do complete enumeration of the permutation distribution, but we'll do it as a randomization test):

Sample A: 23.194 28.027 37.487 31.180 30.430 38.424

Sample B: 34.623 32.936 36.885

 a <- scan()
 23.194 28.027 37.487 31.180 30.430 38.424

 b <- scan()
 34.623 32.936 36.885 

 (mean.diff <- mean(b)-mean(a))
 [1] 3.357667

 sumb <- sum(b)
 alldata <- c(a,b)

 res <- replicate(10000,sum(sample(alldata,3)))

 res <- c(res,sumb)
 extreme1 <- sum(res >= sumb)
 extreme2 <- sum(res >= sumb)+sum(res <= 3*(2*mean(alldata)-mean(b)))
 p.value.1 <- extreme1/10000
 p.value.2 <- extreme2/10000

In my example it gives an (upper) one tailed p-value of $0.19$. and a two-tailed p-value of $0.37$.

You'd have to make a few adjustments for your sample data but that's the gist of how it works.

The exact permutation test p-value in this case is computed thusly:

permsum = combn(alldata,3,sum)
pp.value.1 = sum(sumb<=permsum)/length(permsum)
pp.value.2 = pp.value.1+sum(3*(2*mean(alldata)-mean(b))>=permsum)/length(permsum)
hist(permsum,n=50)
abline(v=sumb,lty=2,col=6)
abline(v=3*(2*mean(alldata)-mean(b)),lty=2,col=8)

The corresponding exact 1- and 2- tailed p-values here are $0.19$ and $0.37$.

Permutation Analysis using unequal samples

1 Answers1