I'm hoping for some help with a problem that arose today.
I have two groups measurements. A total of 200 blood samples from myeloid leukemia patients. 10 samples show mutations in a DNA binding protein that facilitates epigenetic modifiation of chromatin (addition of a methyl group, which impedes gene transcription). 190 samples show no such mutations.
Methylation studies give me the locations of methyl groups on the genome for each sample, so I am doing a test to see if there is a difference (between the two groups) in the number of methylation sites in the vicinity of the binding sites of the aforementioned protein. I therefore have count data for each of the 10 samples and 190 samples giving the distances between all protein binding sites and the NEAREST methylation site, either direction. I have therefore structured the count data to represent the number of counts within a distance window of the protein binding site (number of counts within 0-20 bp).
I find that there are indeed more methylation sites in the vicinity of binding sites in the mutant measurements. A Welch's 2-sample T-test (non-equivalent variances between groups) bears this out for the 0-20 bp window. However, when I run the test using a 0-10 bp window, there is no significance. Also, the statistic becomes significant and increases in significance with increasing window width (i.e., 0-30 bp, 0-40 bp). I believe that this is due to the increased variance at lower counts in the shorter windows, but I may be wrong here.
Also, there is in general more methylation in the mutant samples than non-mutant, and this is not due exclusively to the above protein of interest.
So we have a problem, obviously. It's great that the statistic is significant for the 0-20 bp window, but the statistic is obviously sensitive to window size, so I can't trust it.
I am wondering if anyone has a suggestion regarding another test that I could use. Or an understanding of why the statistic is behaving in this way.