4

I am trying to run some frequency based stats to identify selective sweep in my study system using Rpackage "PopGenome".

To proceed with it I have split my genome data (whole genome sequencing) into 5MB chunks

cregions <- character(31)
cregions[1] <- "1-5000000"
cregions[2] <- "5000001-10000000"
cregions[3] <- "10000001-15000000"
cregions[4] <- "15000001-20000000"
cregions[5] <- "20000001-25000000"
cregions[6] <- "25000001-30000000"
cregions[7] <- "30000001-35000000"
....

I have a bunch of genes and I want to know each of my genes falls within the range of which intervals in the above list.

these are my genes:

chrom start stop gene_id
Chr01 107310787 107323633 gene2748
Chr01 113789595 113794851 gene2894
Chr01 159041389 159068535 gene4806
Chr02 2742629 2758060 gene4846 

I'd like to get an out put like this

gene2748  window_2
gene2894  window_3
...

Any idea how I can do it in R?

Daniel Standage
  • 5,080
  • 15
  • 50
Anna1364
  • 516
  • 2
  • 8

1 Answers1

8

The GenomicRanges package is a great place to start, something like this:

library(GenomicRanges)

# make a GenomicRanges set corresponding to the genes
genes.gr = makeGRangesFromDataFrame(df = genes, start.field = "start", end.field = "stop", seqnames.field = "chrom")

# turn your windows into a data frame and make a GenomicRanges set from that
windows = ...
windows.gr = makeGRangesFromDataFrame(df = windows ...

# find overlaps between the two sets
olaps = findOverlaps(genes.gr, windows.gr)

# output relevant fields
cat(paste(genes[queryHits(olaps),"gene_id"], windows[subjectHits(olaps),"window_id"]))
Jay Moore
  • 1,012
  • 7
  • 9