4

I was trying to find a way to find overlaps between two genomic ranges. I found a post at Biostars but it couldn't consider the chromosome information. For example:

library(IRanges)
df1 = data.frame(chr = c("chr1", "chr12"), start = c(10000, 10000), end = c(20000, 20000))
df2 = data.frame(chr = rep("chr1", 4), posn = c(100, 12000, 15000, 250000), x = rep(1, 4), y = rep(2, 4), z = rep(3, 4))

df1.ir = IRanges(start = df1$start, end = df1$end, names = df1$chr)
df2.ir = IRanges(start = df2$posn, end = df2$posn, names = df2$chr) 

df1.it = NCList(df1.ir)

overlap_it = findOverlaps(df2.ir, df1.it)
print(overlap_it) 

It returns the overlap from both chromosomes. I am wondering if there is a way to keep the chromosome information intact.

1 Answers1

3

Just wanted to follow up to a question asked at Biostars. My solution and suggestion when dealing with genomic ranges.

library(GenomicRanges)
df1 = data.frame(chr = c("chr1",  "chr12"), start = c(10000,  10000), end = c(20000, 20000))
df2 = data.frame(chr = rep("chr1", 4), posn = c(100, 12000, 15000, 250000), x = rep(1, 4), y = rep(2, 4), z = rep(3, 4))

df1.ir = IRanges(start = df1$start, end = df1$end, names = df1$chr)
df2.ir = IRanges(start = df2$posn, end = df2$posn, names = df2$chr) 

df1.it = NCList(df1.ir)

overlap_it = findOverlaps(df2.ir, df1.it)
print(overlap_it)

# define a genomic interval tree (git) for faster search 

df1.gr = GRanges (IRanges(start = df1$start, end = df1$end), seqnames=df1$chr) 
df2.gr = GRanges(IRanges(start=df2$posn, end = df2$posn), seqnames = df2$chr) 

df1.git  = GNCList(df1.gr)
t1 = Sys.time ()
overlap_git = findOverlaps(df2.gr, df1.git)
t2 = Sys.time()
sub1 = difftime(t2, t1, tz, units = c("auto"))
print (sub1)
print (overlap_git)

t1 = Sys.time ()
overlap_intersect = intersect(df1.gr, df2.gr)
t2 = Sys.time ()
print(overlap_intersect)
sub2 = difftime(t2, t1, tz, units = c("auto"))
print(sub2)
  • One should be informed about differences between IRanges and GRanges (for example chromosome context); for example, findOverlaps with IRanges returns intersections from both chr1 and chr12, but with GRanges it doesn't
  • Interval Tree based search is way faster (almost 10x) than simple intersect function