Convert BAM to properly paired FASTQ files

Question

I thought I had figured this one out. But apparently not.

I need to convert a BAM file of paired-end alignments to two FASTQ files of paired reads to realign them, with a twist: I only want reads that fall within a defined region.

Currently I am using the following command (the region is an example; it also fails on different regions; at any rate, this is on hs37d5 but presumably the issue would exist for other references too):

samtools view -b -u -@ 48 alignments.bam 21:31,831,502-31,896,094 \
| samtools collate -n 128 -u -O -@ 48 - /tmp/bam-to-fastq- \
| samtools fastq -F 0x900 -@ 48 \
    -0 /dev/null -1 reads_R1.fastq.gz -2 reads_R2.fastq.gz -

(Apart from the region filtering, this command is essentially the same that word of god tells us to use.)

The resulting files are not properly paired:

⟩⟩⟩ diff <(gunzip -c reads_R1.fastq.gz | sed -n -e 1~4p -e 10q) \
1 ⟩   <(gunzip -c reads_R2.fastq.gz | sed -n -e 1~4p -e 10q)
1,2d0
< @A00346:17:H52H7DSXX:3:2142:32316:24518
< @A00346:17:H52H7DSXX:3:1153:6705:26412
3a2,3
> @A00346:17:H52H7DSXX:3:1173:9118:3129
> @A00346:17:H52H7DSXX:3:1109:29414:11757

As far as I can see from performing a spot check, these are all reads which are marked as paired in the BAM file, but for which only one mate falls within the filtered region.

And running bwa mem on the files gives me the error:

[mem_sam_pe] paired reads have different names: […]

How can I ensure that only properly paired reads are included in the FASTQ files?

I'm guessing this is just that some reads fall in the target region but their pairs fall outside it. I don't know how to do this (hence a comment), but have you tried extracting only those reads that fall entirely within the target region? — terdon, Jul 09 '19 at 11:32
@terdon Right, that’s precisely the problem. However, I’d naively expect samtools fastq to do the right thing here, and actually output pairs of reads, or no read at all. Because resolving this efficiently separately is pretty hard. — Konrad Rudolph, Jul 09 '19 at 12:40
The samtools devs might not agree that this is a bug, but the use case you describe is certainly sufficiently common that they should consider implementing it as an option. There's a strong argument that it should really be the default behavior, but that would probably require a major version bump, which raises another set of practical considerations. — Daniel Standage, Jul 09 '19 at 13:28
That said, I can see some complications with its implementation. We typically see paired reads mapped within a few hundred base pairs, but how will it affect performance if we have very large insert libraries or pairs that are discordantly mapped? In the absence of a data structure that explicitly tracks the location of a read's pair, there are some edge cases that wouldn't be straightforward to handle. From this perspective, I'm not surprised the use case you describe doesn't seem to be supported out-of-the-box. — Daniel Standage, Jul 09 '19 at 13:32
@DanielStandage samtools collate should take care of this issue: If both reads of a pair are present, they will be adjacent; so no fancy data structure is necessary. In principle this should even be possible using a simple additional pass using sed but I’m currently failing at that. — Konrad Rudolph, Jul 09 '19 at 13:41
@KonradRudolph The range-based query (presumably?) requires position-sorted reads, where pairs are separated. Yes, the collate operation should bring the pairs together, but only if both pairs were in the query range. My point was that when reads are sorted by position it's not always straightforward to extend the search to grab pairs of all reads in the query region. — Daniel Standage, Jul 09 '19 at 13:47
@KonradRudolph Collating before executing the range query would address the issue, but I don't think it's possible to execute a range query on alignments that aren't sorted by genomic position. — Daniel Standage, Jul 09 '19 at 13:49
@DanielStandage I don’t understand your point then: Filtering of paired reads only needs to happen in the last step, i.e. samtools fastq (or immediately before). That’s why the “current” order of operations (filtering for region, name collation, filtering by proper pairing TBD) is fine, and there’s no issue requiring fancy data structures. — Konrad Rudolph, Jul 09 '19 at 14:06
@KonradRudolph I think I misunderstood. I was talking about the difficulty of implementing a strategy that makes sure that if either end of the read is in the query region, then both ends are included in the output. It's also possible to maintain proper pairing by discarding any read whose pair doesn't fall within the query region. After re-reading your post it looks like this is what you were suggesting. And indeed should be easier to implement. — Daniel Standage, Jul 09 '19 at 14:16

score 6 · Accepted Answer · answered Jul 09 '19 at 14:07

As noted in the comments, the problem is “some reads fall in the target region but their pairs fall outside it”, leading to non-trivial numbers of singleton reads coming out of samtools collate.

… | samtools fastq -F 0x900 -@ 48 \
    -0 /dev/null -1 reads_R1.fastq.gz -2 reads_R2.fastq.gz -

Your samtools fastq command is not doing anything to siphon off these singleton reads. You should add -s /dev/null to get rid of them.

See the discussion in samtools/samtools#874 (especially the part starting here). I have not looked at how the recent changes from jkbonfield, which will land in an upcoming samtools version (after 1.9), might affect this.

Oooh, that explains why I previously thought this worked! My script used to include -s /dev/null but then I removed it because I thought it had no effect. The documentation of these different outputs is somewhat confusing. — Konrad Rudolph, Jul 09 '19 at 14:14

Convert BAM to properly paired FASTQ files

1 Answers1