6

I am working with bam files and I have to check if reads of a specific position or their mates are soft clipped. So, I am looking for a fast way to extract the read pairs from a bam file in python. So far, I use pysam and fetch reads of a given position. From those reads, I have access to the CIGAR string and can check them for soft clipping. My problem is that I cannot access the CIGAR of the read mates. Is there a way to access information of read mates with pysam or another tool, especially CIGAR strings, without iterating over the whole .bam file?

Unlike in this question Extracting the CIGAR string from a BAM via Python? I am looking for the CIGAR of both reads of a pair.

Mereven
  • 61
  • 2

1 Answers1

5

I think that the easiest way is to work with query-name sorted groups of reads. In that case, mates will be adjacent in the sort and you can use that to extract the paired CIGARs.

If you are depending on position sorting to extract the alignments of interest efficiently, you can maybe do the following:

  1. extract alignments in the region of interest plus a lot of padding to ensure that you are also getting the mates (this will obviously only work with standard short-insert shotgun libraries).

  2. Query-name sort the list iterator using built-in sort() or sorted() with a key function that returns the query name.

  3. Iterate over the pairs using adjacency in the iterator to extract paired CIGARs. Discard unpaired reads (first reads whose query name does not match supposed mates).

Code would look something like this (I have not run or tested this in any way!):

import pysam

def get_query_key(read): """read is pysam.AlignedSegment""" return read.query_name

def get_paired_cigars(bam, chr, start, end, padding=1000): """bam is pysam.AlignmentFile""" reads = list(bam.fetch(bam, chr, start-padding, end+padding)) # may fail at chr ends reads.sort(key=get_query_key) first_cigar = None mate_cigar = None cigars = list() qn = "" for read in reads: if first_cigar is None: first_cigar = read.cigarstring qn = read.query_name elif mate_cigar is None: if read.query_name != qn: first_cigar = read.cigarstring qn = read.query_name else: mate_cigar = read.cigarstring if first_cigar is not None and mate_cigar is not None: cigars.append((first_cigar, mate_cigar)) first_cigar = None mate_cigar = None return cigars

I agree that I would like this to be simpler! Unfortunately BAM files are not terribly well-designed for using pairing information at that level of detail IMO, I have always had to use something like this. I would be interested to see a straightforward alternative to this approach from someone with more experience.

Maximilian Press
  • 3,989
  • 7
  • 23