Targeted NGS, up to 99% of reads have been marked as duplicates

Question

Currently I'm performing whole analysis (pipeline from *.fastq to *.vcf) of 41 samples (targeted NGS). I rely on GATK best practices, however with some modifications. I decided to use the following tools:

#mapping
bwa mem (with mem - alternate haplotypes)
#preprocessing
RevertSam (Picard)
MergeBamAlignment (Picard)
MarkDuplicates (Picard)
BaseRecalibrator (GATK)
ApplyBQSR (GATK)
AnalyzeCovariates (GATK)
SortSam (Picard)
#call, filter and annotate variants
HaplotypeCaller (GATK)
GenotypeGVCFs (GATK)
snpEff, SnpSift

Nevertheless, after marking the duplicates with MarkDuplicates about 99% of reads have been marked as duplicated, hence the coverage decreased from several hundreds to literally up to five. So I decided to omit this step.

Have you ever encountered a similar problem? What is your opinion about this?

Best Regards, Adam

EDIT

The Agilent Haloplex custom target enrichment kit (HaloPlex Target Enrichment for Illumina Kit) has been used to prepare the libraries, sequencing done using Illumina HiSeq 2500. Targeted NGS of all exons (only) of 90 genes.

Here I provide how the final bam look like. This is the result without duplicate marking, otherwise there're only several reads per exon...

can you clarify what you mean by targeted NGS? What are you targeting/capturing and how were the sequencing libraries prepared? — conchoecia, Feb 14 '19 at 06:32
I believe, using haloplex target capture, you expect reads to have identical begin and end coordinates, so imho in this case you should skip duplicate marking/removal. — Wouter De Coster, Feb 14 '19 at 07:43
Wouter is right, Haloplex is PCR based, consequently you expect high duplication. Do not deduplicate this data prior to analysis. — , Feb 14 '19 at 12:06
Thanks for help! By the way, what do you think about GATK as a tool for such analysis. For me it's nice but a bit too complicated. — Adamm, Feb 14 '19 at 12:39
Indeed after MarkDuplicates the coverage was up to 4. It was hard to distinguish between actual alternations and mismatched bases. — Adamm, Feb 15 '19 at 05:55

score 3 · Answer 1 · answered Feb 14 '19 at 06:31

3

If you're doing target capture, especially of a small region, it is entirely possible that there are only five unique reads. Target capture is molecularly finicky and sometimes the entire region is not captured.

answered Feb 14 '19 at 06:31

conchoecia

3,141
2
16
40

score 1 · Accepted Answer · answered Feb 15 '19 at 10:23

Most duplicate marking programs do it by looking at the start and end location of hits. In this case, if two reads have identical start and end locations for both reads, then they will be considered [PCR] duplicates of each other. When running through a duplicate filter, you will only get one read for each unique pair.

This is much more likely for targeted sequencing, because the length covered by the template is much smaller. It's also more likely that you actually do have lots of PCR duplicates for the same reason.

Targeted NGS, up to 99% of reads have been marked as duplicates

EDIT

2 Answers2