1

I've been running some simulations on genotype effects on phenotypes. I'm representing genotypes by 0, 1 or 2 (the number of copies of the minor allele) and phenotypes as 0 or 1. Thus my contingency table is 2x3. I then run R's built-in Fisher Exact Test.

The problem is that, whether I simulate the null or alternative hypothesis, when I plot a histogram of the distribution of p-values, I get has a spike at p = 1. I suspect that the spike is caused by some of the minor alleles being too rare; when this happens, the two rows of the contingency table may bunch too close together. For example, if my contingency table is

499 1 0

499 1 0

the p-value will be reported as 1, even though in reality the two rows come from very different distributions.

This is understandable; but my question is: Is R too eager to report p-value = 1? For example, R also reports a p-value of 1 for the contingency table

499 1 0

498 2 0

which exacerbates the spike in the histogram. Is this reasonable?

Thanks

J.D.
  • 59
  • 8

1 Answers1

1

Fisher's exact test is doing what it is designed to do in your final example.

You have $1000$ cases and it knows it will see $500$ in each row, and $997$ in the first column and $3$ in the second column and $0$ in the third column. There are actually only four possible patterns of arranging these:

500 0 0     499 1 0     498 2 0     497 3 0   
497 3 0     498 2 0     499 1 0     500 0 0 

with probabilities $\frac{{997 \choose 500}{3\choose 0}{0\choose 0}}{{1000 \choose 500}}$, $\frac{{997 \choose 499}{3\choose 1}{0\choose 0}}{{1000 \choose 500}}$, $\frac{{997 \choose 498}{3\choose 2}{0\choose 0}}{{1000 \choose 500}}$, $\frac{{997 \choose 497}{3\choose 3}{0\choose 0}}{{1000 \choose 500}}$, i.e.

 0.1246      0.3754      0.3754      0.1246

So the first and fourth are the least likely and most extreme examples (equally so) so give a combined $p$-value of $0.2492$ and so each of them gives this $p$-value,

while the second and third examples are the most likely (equally so), so give a combined $p$-value of $1$ and so each of them gives this $p$-value.

Henry
  • 39,459
  • Thanks for the reply. That clarifies things. Just one clarification: How did you get the probabilities of each of the four contingency tables? Clearly they don't sum to 1. – J.D. Apr 08 '20 at 01:46
  • 1
    That is because I took a shortcut and made a mistake. I have corrected this using the hypergeometric probabilities – Henry Apr 08 '20 at 07:57