1

We are working on a bioinformatics problem. We have a collection (say $U$) of kmers or small substrings of length $k$ of some set of sequences. We have two different experiments. Only a subset of kmers say, $A,B \subseteq U$, is enriched in each of the two experiments, respectively. We want to show $A \cap B$ is statistically significant.

We were trying to use Fisher's exact test (FET) on a contingency table with entries $|A \cap B|, |A \setminus B|, |B \setminus A|, |U \setminus (A \cup B)|$. But it seems that this FET has the null hypothesis: “probability of a kmer getting enriched is same in the two experiments”.

However, for our purpose it should be the alternate hypothesis.

Is there a different way to use FET for our purpose?

  • Not a categorical data expert, so this is in a comment. You've hit a core issue in null-hypothesis significance testing. I think you could get at what you want by combining a permutation test with a two one-sided test approach. You have a two-part null: Determine a difference in probabilities that would be substantively meaningful, and the null is that the population difference is greater than that in either direction. Then use a permutation test and use the resulting distribution to establish whether the null(s) can be rejected. – Patrick Malone May 29 '18 at 16:45
  • Thank you @PatrickMalone. I am quite new in statistical significance testing. Could you please point me to some examples where this approach of combining permutation test and two one-sided test has been applied? – Soumitra May 29 '18 at 17:16
  • If I get you right, you must have 2x2 frequency table with sides A (1 vs 0) and B (1 vs 0). Cell (1,1) collects k-mers which belong to both A and B. You want to test that frequency f in (1,1) is bigger than expected there under randomness (that would be f_in_row1*f_in_col1 / f_total). So, you test that the residual in (1,1) is significantly positive. But in your situation of 2x2 table an increase in (1,1) automatically means same-size inrease in (0,0) and same decrease in (1,0) and (0,1), because marginal totals are fixed. – ttnphns May 29 '18 at 18:56
  • (cont.) Therefore sig. testing of cell (1,1) is equivalent to testing any of the 4 cells, i.e. to testing all the table for the sig. of association/independence. It could be chi-square test, but you preferred Fisher exact test. It is all right. I don't see why you are in doubts. – ttnphns May 29 '18 at 18:57
  • Sorry, I'm just speculating here. – Patrick Malone May 29 '18 at 19:09
  • @ttnphns I think the concern is that Soumitra wants to endorse the hypothesis of no difference. – Patrick Malone May 29 '18 at 19:12
  • Thanks @PatrickMalone. I am not able to see the connection between the alternative hypothesis of FET that “the probability of a kmer getting enriched is different in the two experiments” and our alternative hypothesis that "the overlap is better than expected under randomness". That is how the two hypotheses are equivalent (one implies the other). – Soumitra May 29 '18 at 19:21
  • I may have misunderstood your problem. I thought your alternative hypothesis was that "the overlap is as expected under randomness." If so, I apologize for the diversion. – Patrick Malone May 29 '18 at 20:43
  • Thanks @PatrickMalone. However, I still do not understand how our alternative hypothesis matches with that of FET. Do you see what ttnphns suggests? – Soumitra May 29 '18 at 21:15
  • Thank you @ttnphns for providing me some direction. However, I am not able to see how you linked my question to residuals and then to Fisher's exact test. In particular, I did not understand the following two steps in your argument. 1) why expected frequency in (1,1) would be f_in_row1*f_in_col1 / f_total? 2) how testing significance of single cell is equivalent to testing significance of association/independence? – Soumitra May 29 '18 at 22:01

0 Answers0