3

I would like to find out where (in genomic coordinates) large insertions are found within a given BAM file, file.bam.

(In terms of genomic coordinates, I just mean I would like to have a rough idea where to look at the BAM in IGV, chromosome and position.)

Based on the BAM CIGAR, I should be able to grep the BAM based on some cut-off of "large inserted" bases.

It's unclear to me how to best do this via samtools; one could parse the cigar string via samtools for insertions, but I don't understand how to translate this into the genomic position in the alignment.

This goes with various samtools wrappers, e.g. pysam in python:

import pysam

bamfile = pysam.Samfile(bamFile, "rb");

for read in bamfile:
    if not read.is_unmapped:   ## only get mapped reads
        cigarLine = read.cigar
        for cigarType, cigarLength in cigarLine:
            if cigarType == 1: ## this means it's an insertion...
                  ## somehow filter only large insertions, and then get coordinate information

I'm not sure how to do this in a non-clumsy manner. Any help appreciated.

Devon Ryan
  • 19,602
  • 2
  • 29
  • 60
ShanZhengYang
  • 1,691
  • 1
  • 14
  • 20

2 Answers2

2

You basically have the right framework already in your pysam code, which I took the liberty of tidying up a bit. You just need to add a bit:

for read in bamfile:
    if not read.is_unmapped:
        cigarLine = read.cigar
        interesting = False
        for cigarType, cigarLength in cigarLine:
            if cigarType == 1:
                if cigarLength > 100:  ## Arbitrary value
                    interesting = True
                    break
        if interesting:
            print("{}:{}-{}".format(read.reference_name, read.reference_start, read.reference_end))

Unless you have long reads, this will be close enough for IGV.

Devon Ryan
  • 19,602
  • 2
  • 29
  • 60
1

a one-liner using bioalcidaejdk: http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html

$ java -jar dist/bioalcidaejdk.jar -e 'stream().filter(R->!R.getReadUnmappedFlag() && R.getCigar().getCigarElements().stream().anyMatch(C->C.getLength()>100 && (C.getOperator().equals(CigarOperator.N) || C.getOperator().equals(CigarOperator.D)))).forEach(R->println(R.getContig()+"\t"+R.getStart()+"\t"+R.getEnd()));' input.bam
Pierre
  • 1,536
  • 7
  • 11