2

I'm hoping for some help with a problem that arose today.

I have two groups measurements. A total of 200 blood samples from myeloid leukemia patients. 10 samples show mutations in a DNA binding protein that facilitates epigenetic modifiation of chromatin (addition of a methyl group, which impedes gene transcription). 190 samples show no such mutations.

Methylation studies give me the locations of methyl groups on the genome for each sample, so I am doing a test to see if there is a difference (between the two groups) in the number of methylation sites in the vicinity of the binding sites of the aforementioned protein. I therefore have count data for each of the 10 samples and 190 samples giving the distances between all protein binding sites and the NEAREST methylation site, either direction. I have therefore structured the count data to represent the number of counts within a distance window of the protein binding site (number of counts within 0-20 bp).

I find that there are indeed more methylation sites in the vicinity of binding sites in the mutant measurements. A Welch's 2-sample T-test (non-equivalent variances between groups) bears this out for the 0-20 bp window. However, when I run the test using a 0-10 bp window, there is no significance. Also, the statistic becomes significant and increases in significance with increasing window width (i.e., 0-30 bp, 0-40 bp). I believe that this is due to the increased variance at lower counts in the shorter windows, but I may be wrong here.

Also, there is in general more methylation in the mutant samples than non-mutant, and this is not due exclusively to the above protein of interest.

So we have a problem, obviously. It's great that the statistic is significant for the 0-20 bp window, but the statistic is obviously sensitive to window size, so I can't trust it.

I am wondering if anyone has a suggestion regarding another test that I could use. Or an understanding of why the statistic is behaving in this way.

Matt
  • 23

1 Answers1

0

Dependence of such measures on window sizes is common. You, however, have additional problems for this analysis.

there is in general more methylation in the mutant samples than non-mutant, and this is not due exclusively to the above protein of interest.

This poses a problem for your analysis. If there are more methylated CpG genomic sites overall in your mutant samples, then you would expect to have "more methylation sites in the vicinity of [protein] binding sites" just on that basis, even if CpG sites were randomly distributed along the genome. Presumably what you want is to show that the methylated CpG sites are even more highly located near the protein binding sites than you would expect based on the overall higher methylation level.

Furthermore, as you know CpG sites are not evenly distributed along the genome. They occur typically in clusters, CpG islands "with at least 200 bp, a GC percentage greater than 50%, and an observed-to-expected CpG ratio greater than 60%." So it's not clear that looking for the nearest methylation site or correcting for overall CpG methylation levels will accomplish what you need to convince your audience that you have found something interesting.

There is a fair amount of literature on analysis of CpG islands and methylation, although this is beyond my expertise. This review by Tahir et al, J Biosci (2019) 44:143, discusses many approaches to quantitative analysis of CpG islands susceptible to methylation, including window-based approaches that seem similar to yours. I suspect that the Bioconductor project has well vetted tools for this type of analysis.

EdM
  • 92,183
  • 10
  • 92
  • 267