4

Currently I'm performing whole analysis (pipeline from *.fastq to *.vcf) of 41 samples (targeted NGS). I rely on GATK best practices, however with some modifications. I decided to use the following tools:

#mapping
bwa mem (with mem - alternate haplotypes)

#preprocessing RevertSam (Picard) MergeBamAlignment (Picard) MarkDuplicates (Picard) BaseRecalibrator (GATK) ApplyBQSR (GATK) AnalyzeCovariates (GATK) SortSam (Picard)

#call, filter and annotate variants HaplotypeCaller (GATK) GenotypeGVCFs (GATK) snpEff, SnpSift

Nevertheless, after marking the duplicates with MarkDuplicates about 99% of reads have been marked as duplicated, hence the coverage decreased from several hundreds to literally up to five. So I decided to omit this step.

Have you ever encountered a similar problem? What is your opinion about this?

Best Regards, Adam

EDIT

The Agilent Haloplex custom target enrichment kit (HaloPlex Target Enrichment for Illumina Kit) has been used to prepare the libraries, sequencing done using Illumina HiSeq 2500. Targeted NGS of all exons (only) of 90 genes.

Here I provide how the final bam look like. This is the result without duplicate marking, otherwise there're only several reads per exon... enter image description here

Adamm
  • 206
  • 2
  • 11
  • can you clarify what you mean by targeted NGS? What are you targeting/capturing and how were the sequencing libraries prepared? – conchoecia Feb 14 '19 at 06:32
  • 3
    I believe, using haloplex target capture, you expect reads to have identical begin and end coordinates, so imho in this case you should skip duplicate marking/removal. – Wouter De Coster Feb 14 '19 at 07:43
  • 1
    Wouter is right, Haloplex is PCR based, consequently you expect high duplication. Do not deduplicate this data prior to analysis. –  Feb 14 '19 at 12:06
  • Thanks for help! By the way, what do you think about GATK as a tool for such analysis. For me it's nice but a bit too complicated. – Adamm Feb 14 '19 at 12:39
  • Indeed after MarkDuplicates the coverage was up to 4. It was hard to distinguish between actual alternations and mismatched bases. – Adamm Feb 15 '19 at 05:55

2 Answers2

3

If you're doing target capture, especially of a small region, it is entirely possible that there are only five unique reads. Target capture is molecularly finicky and sometimes the entire region is not captured.

conchoecia
  • 3,141
  • 2
  • 16
  • 40
1

Most duplicate marking programs do it by looking at the start and end location of hits. In this case, if two reads have identical start and end locations for both reads, then they will be considered [PCR] duplicates of each other. When running through a duplicate filter, you will only get one read for each unique pair.

This is much more likely for targeted sequencing, because the length covered by the template is much smaller. It's also more likely that you actually do have lots of PCR duplicates for the same reason.

gringer
  • 14,012
  • 5
  • 23
  • 79