How can I read multiple different regions from a BAM file in R?

Question

I’m trying to process BAM records (of a coordinate-sorted, indexed BAM file) successively in fixed-size, non-overlapping, sliding genome coordinate windows. Unfortunately I am unable to read records via scanBam in different windows (which parameter): the first call to scanBam succeeds but subsequent calls return zero-length results.

By contrast, if I don’t specify a coordinate window and instead use yieldSize when opening the BAM file, I can successfully read records in a loop:

library(Rsamtools)
bamfile = system.file('extdata', 'ex1.bam', package = 'Rsamtools')
process = function (reads) message(length(reads$seq))

bam = open(BamFile(bamfile, yieldSize = 1000L))
param = ScanBamParam(what = 'seq')

repeat {
    reads = scanBam(bam, param = param)[[1L]]
    if (length(reads$seq) == 0L) break
    process(reads)
}

… where process performs the read processing (unimportant for the example).

Now I’m trying to use the following code to work in genomic coordinate windows instead, and it fails as explained above:

windowsize = 1000L
bam = open(BamFile(bamfile))
chr = seqlevels(bam)[1L] # Only do one chromosome for now
chrlengths = seqlengths(bam)
result = lapply(seq(1L, chrlengths[chr] - windowsize, by = windowsize), function (pos) {
    which = setNames(RangesList(IRanges(pos, pos + windowsize)), chr)
    param = ScanBamParam(what = 'seq', which = which)
    reads = scanBam(bam, param = param)[[1L]]
    process(reads)
})

Conversely, it works if I close and re-open the BAM file in each iteration (i.e. before each scanBam call) — but that’s undesirable: it carries a substantial overhead (running on an empty BAM file with 16 chromosomes in its header takes several hours).

It's only returning data from the first iteration, not any subsequent iterations, unless I close the BAM file between scanBam calls.

Since the process function is nontrivial and written in R, I’m stuck with R for now. And there’s no reason why this shouldn’t work in R.

Am I overlooking something? Is this a bug in {Rsamtools}? Is it “by design”?

@gringer I’ve noticed that you’ve added the tag into the question title — why? On Stack Overflow there’s a broad consensus against redundant mentions of the tag in the question title, and in fact I frequently (several times a day, on average) edit to remove such mentions from titles on Stack Overflow (most useless taggings are more egregious, but removing “… in X” is a fair share of the edits I make). — Konrad Rudolph, Jul 08 '21 at 08:57
Following the same logic, why not remove BAM from the title? I don't think that "in R" is redundant here. In this case I considered that there were multiple valid Bioinformatics questions that could begin with "How can I read multiple different regions from a BAM file", even with the "R" tag. In this case it seems like you specifically want an R solution, which is why I added it in. — gringer, Jul 08 '21 at 10:53
Are you working under linux? and do you absolutely have to do it in R? — user324810, Oct 20 '21 at 07:49
@user324810 I mean, this question is three years old. And at any rate, regardless of whether somebody “absolutely” has to do this in R, it should work in R. But for posterity: yes, I absolutely had to do this in R since the process placeholder in my case was a piece of nontrivial logic developed in R over the span of months (in fact, the final implementation was in C++ anyway, the R code was my research prototype for faster iteration). I ended up processing fixed-sized chunks rather than scanning over genomic coordinate windows. — Konrad Rudolph, Oct 20 '21 at 08:12

How can I read multiple different regions from a BAM file in R?

0 Answers0