3

I used BWA-MEM to alignment and I would like to gather the some informations like total % of match, mismatch, insert/delete etc. I am wondering if there is any existing tools that produces this information?

viz12
  • 51
  • 2

3 Answers3

2

Other options... samtools stats maybe gives you some of the stats you are after. Chances are you already have samtools so nothing additional to install. Also useful may be pysamstats, in particular the variation sub-command.

dariober
  • 699
  • 3
  • 6
0

I've written my own script, readstomper, which gives per-site statistics from mpileup results. Here's an example usage:

$ samtools mpileup -B -Q 0 -f circ-Nb-ec3-mtDNA.fasta LAST_OL_132394_vs_mtDNA.bam | \
    /bioinf/scripts/readstomper.pl > LAST_OLmpileupProp_132394_vs_mtDNA.csv

Assembly,Position,Coverage,ref,cR,pR,A,C,G,T,d,i,InsMode
Nb_mtDNA,11456,86,T,79.00,0.92,0.0000,0.0465,0.0000,0.0000,0.03,0.01,A;1.00
Nb_mtDNA,11457,86,A,81.00,0.94,0.0000,0.0000,0.0233,0.0000,0.03,0.02,TA;0.50
Nb_mtDNA,11458,86,G,79.00,0.92,0.0349,0.0233,0.0000,0.0116,0.01,0.00
Nb_mtDNA,11459,86,A,78.00,0.92,0.0000,0.0000,0.0235,0.0471,0.01,0.00
Nb_mtDNA,11460,86,T,5.00,0.06,0.0233,0.8140,0.0233,0.0000,0.08,0.00
Nb_mtDNA,11461,86,G,57.00,0.66,0.1512,0.1163,0.0000,0.0465,0.02,0.01,AATAACAACACGTAACCGA;1.00
Nb_mtDNA,11462,86,G,78.00,0.91,0.0349,0.0349,0.0000,0.0116,0.01,0.00
Nb_mtDNA,11463,86,G,66.00,0.77,0.0465,0.0698,0.0000,0.0116,0.10,0.01,GCAATA;1.00
Nb_mtDNA,11464,86,T,74.00,0.86,0.0000,0.0814,0.0233,0.0000,0.03,0.00

It defaults to reporting variants as proportions of the total coverage, but this can be changed to reporting the individual counts:

$ samtools mpileup -B -Q 0 -f mmus_chrM.fa mm2_1.2M_vs_Mmus_mtDNA.bam | \
    /bioinf/scripts/readstomper.pl --counts

Assembly,Position,Coverage,ref,skip,cR,pR,A,C,G,T,d,i,InsMode
chrM,1528,491,C,0,409,0.87,17,0,2,15,25,65,A;52
chrM,1529,492,A,0,366,0.75,0,2,2,9,112,12,A;5
chrM,1530,493,A,0,409,0.83,0,2,1,13,68,7
chrM,1531,495,A,0,463,0.94,0,1,10,0,21,8,G;5
chrM,1532,496,A,0,463,0.94,0,2,5,8,16,4,G;3
chrM,1533,495,G,0,459,0.93,3,1,0,2,30,4
chrM,1534,496,A,0,451,0.91,0,0,4,1,39,23,G;21
chrM,1535,503,G,0,391,0.78,4,1,0,4,103,9,A;6
chrM,1536,506,G,0,496,0.98,6,0,0,0,4,10,A;3
chrM,1537,506,G,0,488,0.96,8,1,0,4,5,5,A;3
chrM,1538,506,A,0,483,0.95,0,4,3,4,12,12
chrM,1539,506,C,0,466,0.92,9,0,2,14,14,27,A;6
chrM,1540,506,A,0,469,0.93,0,0,17,6,14,11,T;3
chrM,1541,506,G,0,453,0.90,12,1,0,5,31,21,A;4
chrM,1542,505,C,0,421,0.83,6,0,2,14,62,8,CT;3
gringer
  • 14,012
  • 5
  • 23
  • 79
0

using BioAlcidaeJdk: http://lindenb.github.io/jvarkit/BioAlcidaeJdk.html , you can loop over the cigar strings of each mapped read and extract the number of inserted bases, clipped bases, etc...

java -jar dist/bioalcidaejdk.jar -e 'stream().filter(R->!R.getReadUnmappedFlag() && R.getCigar()!=null).flatMap(R->R.getCigar().getCigarElements().stream()).collect(Collectors.toMap(C->C.getOperator(),C->(long)C.getLength(),(L1,L2)->L1+L2)).forEach((O,N)->println(O.name()+"\t"+N));' toy.bam
I   15
H   11
S   1
D   1
N   14
P   2
M   191

to get the number of mismatches, you can preprocess the bam with http://lindenb.github.io/jvarkit/SamFixCigar.html

Pierre
  • 1,536
  • 7
  • 11