I’m trying to process BAM records (of a coordinate-sorted, indexed BAM file) successively in fixed-size, non-overlapping, sliding genome coordinate windows. Unfortunately I am unable to read records via scanBam in different windows (which parameter): the first call to scanBam succeeds but subsequent calls return zero-length results.
By contrast, if I don’t specify a coordinate window and instead use yieldSize when opening the BAM file, I can successfully read records in a loop:
library(Rsamtools)
bamfile = system.file('extdata', 'ex1.bam', package = 'Rsamtools')
process = function (reads) message(length(reads$seq))
bam = open(BamFile(bamfile, yieldSize = 1000L))
param = ScanBamParam(what = 'seq')
repeat {
reads = scanBam(bam, param = param)[[1L]]
if (length(reads$seq) == 0L) break
process(reads)
}
… where process performs the read processing (unimportant for the example).
Now I’m trying to use the following code to work in genomic coordinate windows instead, and it fails as explained above:
windowsize = 1000L
bam = open(BamFile(bamfile))
chr = seqlevels(bam)[1L] # Only do one chromosome for now
chrlengths = seqlengths(bam)
result = lapply(seq(1L, chrlengths[chr] - windowsize, by = windowsize), function (pos) {
which = setNames(RangesList(IRanges(pos, pos + windowsize)), chr)
param = ScanBamParam(what = 'seq', which = which)
reads = scanBam(bam, param = param)[[1L]]
process(reads)
})
Conversely, it works if I close and re-open the BAM file in each iteration (i.e. before each scanBam call) — but that’s undesirable: it carries a substantial overhead (running on an empty BAM file with 16 chromosomes in its header takes several hours).
It's only returning data from the first iteration, not any subsequent iterations, unless I close the BAM file between scanBam calls.
Since the process function is nontrivial and written in R, I’m stuck with R for now. And there’s no reason why this shouldn’t work in R.
Am I overlooking something? Is this a bug in {Rsamtools}? Is it “by design”?
processplaceholder in my case was a piece of nontrivial logic developed in R over the span of months (in fact, the final implementation was in C++ anyway, the R code was my research prototype for faster iteration). I ended up processing fixed-sized chunks rather than scanning over genomic coordinate windows. – Konrad Rudolph Oct 20 '21 at 08:12