Estimating a p-value when you can't compute it for the whole set

Question

I have a two large lists and I want to compute the p-value that they are similar. The similarity algorithm is a black-box, but for this application we will trust it gives accurate p-values. My problem is my list is too large and the algorithm won't give answers for them. However, I think that if I take a smaller random sample of each list the algorithm will give a p-value. Is it permissible to randomly sample both lists a large number of times and average the p-values?

Should I randomly sample with or without replacement?

your problem definition needs more clarity. What are you testing for between your (large list/) variables? Are you wanting to test for equality of population means? If what you want to check for is a "black box", meaning checking for "similarity", then how do you know what statistical test to perform? In short, I think, first the hypothesis has to be specified and specific (as to what you are testing for). Then, the actual question about implementation comes. — Arun, Oct 18 '11 at 12:33
Does the algorithm really output a p-value alone and nothing else? If it also gives you estimates or some other statistics that might make things easier as there are ways of combining such results from a set of 'batches'. — onestop, Oct 18 '11 at 16:53
@onestop, Arun: Basically, it's a proprietary algorithm that gives a p-value for similarity. I would like to assume that the algorithm has an appropriate null hypothesis and that it has an accurate way of placing whatever test statistic it generates in the distribution of test statistics (for the null hypothesis) and thus is able to give an accurate p-value so far as it goes (if this makes any sense). Assume also that this test statistic distribution is not normal (and probably empirically constructed with no closed form) ... is the answer to my original question a yes or a no? — Henry B., Oct 19 '11 at 01:36

score 4 · Accepted Answer · answered Oct 20 '11 at 10:09

It's not correct to randomly sample both lists a large number of times and average the p-values; the result would understate the evidence against the null hypothesis if it's false, as you then expect the p-value to get smaller as the sample size gets larger, but with this procedure it would stay the same on average.

Instead I'd suggest using Fisher's combined probability test to combine the p-values. This assumes the p-values come from independent tests, so you want to sample without replacement so that each list value only occurs in one sample. Equivalently, randomly order both lists then divide them up into suitable-sized chunks to feed into your black box.

Estimating a p-value when you can't compute it for the whole set

1 Answers1