14

If there are soft clipped base pairs specified in the CIGAR string for a read in a SAM/BAM file, will these be used for variant calling in a samtools + bcftools workflow?

The GATK HaplotypeCaller, for example, has an explicit option --dontUseSoftClippedBases for whether to use soft clipped bases. The samtools documentation does not mention clipped bases.

mattm
  • 754
  • 7
  • 19

1 Answers1

15

No, samtools (and therefore bcftools) does not use soft-clipped bases. You can quickly confirm this by using either samtools depth or samtools mpileup to look at a region with a soft-clipped alignment. You'll note that the soft-clipped region isn't used in the depth/pileup (both tools use the same underlying code, so it doesn't matter which you use). If you're curious, samtools ignores soft-clipped bases because it's based on making a per-base stack of alignments covering each position. In the BAM format, alignments are sorted and assigned to bins according to their start/end positions, which won't include soft-clipping. Consequently, when samtools is making the pileup it won't even see the alignments that would overlap a given base if soft-clipped bases were included.

This then sort of begs the question of what GATK's HaplotypeCaller is doing differently. There, regions in the genome are essentially assembled in a small de Bruijn graph, which allows for soft-clipped bases around indels to then be resolved, given that the graph would start/end a little-way on past each side of indels. This is also why you don't need to do indel realignment with the HaplotypeCaller (this was needed in the old UnifiedGenotyper).

Edit: For more details regarding the HaplotypeCaller, see this nice page on GATK's website, which goes into much more detail than I did here.

Devon Ryan
  • 19,602
  • 2
  • 29
  • 60