7

Is there a standard method in Python to extract a CIGAR string from the BAM?

There are great libraries which parse the CIGAR, e.g. https://pypi.python.org/pypi/cigar/0.1

>>> c = Cigar('10M20S10M')
>>> c.mask_left(10).cigar
'10S20S10M'
>>> c.mask_left(9).cigar
'9S1M20S10M'
>>> Cigar('10S').mask_left(10).cigar
'10S'
>>> Cigar('10H').mask_left(10).cigar
'10H'
>>> Cigar('10H').mask_left(11).cigar
'10H'
>>> Cigar('10H').mask_left(9).cigar
'10H'

It looks like pysam already parses the cigar string, if I'm not mistaken:

import pysam

bam = 'myfile.bam'
bamfile = pysam.AlignmentFile(bam, 'rb')

for read in bamfile:
    if not read.is_unmapped:
        cigar = read.cigar
        print(cigar)

This outputs lists of tuples of the parse output:

[(5, 61), (0, 30), (5, 198)]
[(4, 11), (0, 30), (4, 248)]
[(4, 11), (0, 30), (4, 248)]
....

How could I simply output the CIGAR from read? I would prefer to create a column in a pandas DataFrame

ShanZhengYang
  • 1,691
  • 1
  • 14
  • 20

2 Answers2

4

If you really do just want the cigar string then it's read.cigarstring. However, I'm not sure what you're trying to gain with the cigar package from Brent. Unless you want to get the string with the masking changed then the tuple you get from pysam is the same as what you get from cigar (with the exception of the numeric operations instead of character operations).

Devon Ryan
  • 19,602
  • 2
  • 29
  • 60
  • Oh, the API on the website had me confused with cigar string. Perhaps I could use the parsed cigar from pysam, but I'm confused: How do I parse the list of tuples in order count the total number of insertions, deletions, matches (and possible other information) from this? I apologise if this is documented, but the pysam manual is confusing imho. – ShanZhengYang Mar 09 '18 at 19:42
  • Should I read the above, e.g. the first line [(5, 61), (0, 30), (5, 198)] as 61 hard-clippings, 30 matches, and 198 hard clippings? – ShanZhengYang Mar 09 '18 at 19:52
  • Yes, that is the correct interpretation of that list of tuples, which is documented in the API – Devon Ryan Mar 09 '18 at 20:19
  • I see; this question is largely my misunderstanding of the pysam website. I'm happy to delete this question, unless it's helpful for the community. – ShanZhengYang Mar 10 '18 at 18:06
0
# pip install pyranges
# pip install bamread

# wget http://hgdownload-test.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeRikenCage/releaseLatest/wgEncodeRikenCageGm12878CellPapAlnRep1.bam # (half a gig)    
# samtools index wgEncodeRikenCageGm12878CellPapAlnRep1.bam

f = "wgEncodeRikenCageGm12878CellPapAlnRep1.bam"

import pyranges as pr
df = pr.read_bam(f, sparse=False, as_df=True)

df[["Chromosome", "Start", "End", "Strand", "Cigar"]].drop_duplicates("Cigar")
#          Chromosome  Start    End Strand             Cigar
# 0              chr1  10579  10606      -               27M
# 36             chr1  29334  29360      +           8M1I18M
# 37             chr1  29334  29360      +           7M1I19M
# 38             chr1  29334  29360      +           5M1I21M
# 44             chr1  29339  29365      +           4M1I22M
# ...             ...    ...    ...    ...               ...
# 19655298       chrM  14550  14575      +       16M5I3M3D3M
# 19660213       chrM  14747  14775      -       15M5D5M4I3M
# 19663987       chrM  15374  15399      -   2M1I2M2D4M3I15M
# 19676864       chrM  16234  16271      +  15M1I4M12D5M1I1M
# 19677003       chrM  16264  16291      +   14M1D3M2D1M3I6M
The Unfun Cat
  • 461
  • 3
  • 10