How many false positive duplicates are marked using just the position of first unclipped base?

Question

In the popular picard MarkDuplicates tool, a read is marked as a duplicate if it has the same position as another read starting from their first unclipped base in the 5'-direction. The implication is that if two reads have this property, it must be much more likely they arose from some "artificial" duplicates (PCR, optical, etc.) rather than due to sheer coincidence that two independent fragments of DNA happened to have this property and were sequenced in the same run.

Does anyone have links to studies or other data that attempts to measure how often this mechanism for labeling duplicates fails? I.e. how frequently does a read get marked a duplicate in this way "incorrectly," meaning it was actually just a coincidence and really did correspond to an independent fragment that should be considered in your analysis.

I imagine this answer can vary quite a bit based on the sequencing conditions: library prep, machine, etc. I was hoping for at least a ballpark estimate in the case of some "standard" modern short read sequencing methods. But it should be possible to study this in particular cases, e.g. using UMIs, doing PCR vs non-PCR runs, etc.

Yossi Farjoun · Accepted Answer · 2024-01-12T16:36:00.100

There's a tool in Picard (full disclaimer: I wrote it...) called CollectIndependentReplicateMetrics which performs the test you are looking for, I think.

It measures the percent of duplication that is "independent", i.e. which arose from randomly sampled original fragments that happen to be duplicates, rather than due to technical (optical, PCR, bridge-amp...) reasons.

The way it works is by looking at duplicate-sets of size 2 (and separately, 3) over heterozygous sites in the bam (which is assumed to be a diploid sample) it classifies the sets into "homogeneous" and "heterogeneous" based on the alleles of the hetrozygous variant. From the relative sizes of these two classes, one can work out the percent of duplication that is "independent" since independent duplication will give rise to heterogeneous sets (sometimes), while dependent duplication will never do that.

The assumptions are quite simple actually:

a technical duplicate will have the same allele
an independent replicate will have the same allele 50% of the time

So the rate of independence is 2x the rate of heterozygosity, within duplicate sets of size 2.

For sets of size 3 the math is a little more complex, so I wrote it up and docuemnted it.

Here's the verbatim explanation I wrote at the time (in the source-code of the Metric file...):

Explanation of calculation of independent replication rate from heterozygosity rate (within triplicate sets):
We assume the there are two types of replication events:

those that are "independent", such that we just happen to get 2 fragments from the exact same region

(These get a random allele so effectively change the allele with probability 0.5), and

those that are "artifactual" (do not change the allele)

We skip sets that have unexpected alleles as they do not fit our model. In the following we use the term "duplicates" to indicate that
the read-pairs would be marked as duplicates, not that they actually are technical duplicates.
To reach a triplicate set, 2 replication events are required so there are the following options, assuming
that the independent replication rate is x (thus the artifactual is 1-x). We assume a diploid organism with no bias towards either allele
in heterozygous sites, so an idependent replication will result in the other alleles in half the cases:
0 -> 0, 1 happens with probability x/2, therefore
0 -> 0, 0 happens with probability 1-x/2
(This is the explanation that is required for calculating the independent replication rate from doubleton sets...quite simpler)
Without loss of generality we assume that we "start" with allele 0 and that 1 is the other allele.
Each of the resulting alleles ("First" or "second") can replicate (each with probability 0.5) so we get:
from first row:
0=>1 1 (0 replicate to a 1) with probability  x/2 * 0.5 * x/2
0=>0 1 (0 replicate to a 0) with probability  x/2 * 0.5 * (1-x/2)
=====subtotal = x/4
0 1=>0 (1 replicate to a 0) with probability x/2 * 0.5 * x/2
0 1=>1 (1 replicate to a 1) with probability x/2 * 0.5 * (1-x/2)
=====subtotal = x/4
from second row:
0=>1 0 (first 0 replicate to a 1) with probability (1-x/2) * 0.5 * x/2
0=>0 0 (first 0 replicate to a 0) with probability (1-x/2) * 0.5 * (1-x/2)   <======= Homogeneous set
=====subtotal = (1-x/2)/2
0 0=>1 (second 0 replicate to a 1) with probability (1-x/2) * 0.5 * x/2
0 0=>0 (second 0 replicate to a 0) with probability (1-x/2) * 0.5 * (1-x/2)  <======= Homogeneous set
=====subtotal = (1-x/2)/2
total is x/2 (from first two sub-totals) + (1-x/2) (from last two sub-totals) = 1
We differentiate between a heterogeneous result (with 0's and 1's in the set) and homogeneous results
(all zeros, since we assumed WLOG that we start with 0)
The probability of a homogeneous set is therefore the sum of the two only homogeneous results
P(hom) = (1-x/2) * 0.5 * (1-x/2) + (1-x/2) * 0.5 * (1-x/2) = (1-x/2)^2 = y
where y = P(hom) is the rate of homogeneity within triplicate sets.
Solving for x we find that 2 * ( 1 - sqrt(y) ) = x.

I should also mention that this tool is based on the DuplicateSetIterator which is a really convenient tool (from htsjdk) for exploring duplication. — Yossi Farjoun, Jan 10 '24 at 21:22
Hi Yossi, thanks for this! I just tried running this tool on a public sample using an Illumina 30x bam we have on a couple random chromosomes (14, 18, 20; 3 segfaulted for some reason). I ended up seeing biSiteHeterogeneityRate in the range 0.000544 to 0.000751 and triSiteHeterogeneityRate 0.001005 to 0.001262. If I'm understanding correctly, does this suggest (if my one sample is representative) that for this particular sequencing process we'd expect something like 0.05%-0.13% false positive duplication rate (so around 1/1000 duplicate reads)? — Ricky, Jan 12 '24 at 04:31
I'm assuming the tool has some built-in prior distribution that genuine duplicate-sets of size 2 at a het site should be 100% either ref or hom, but then tracks the rate at which the duplicates disagree at those sites? Does the tool have some other fancy techniques for trying to disambiguate the cases where you by chance have false positive duplicates but they both agree on the allele? I suppose one could take the "FP rate" from hets and extrapolate a similar rate to these sites and estimate some sort of overall FP duplication rate... — Ricky, Jan 12 '24 at 04:34
I documented the "model" in the tool itself...let me add to my answer. — Yossi Farjoun, Jan 12 '24 at 16:30
@Ricky the numbers in the metric file that yoyu need are independentReplicationRateFromTriDups and independentReplicationRateFromBiDups. The tell you the estimated fraction of duplication which is not really a duplicate. I.e it came from a different original molecule. — Yossi Farjoun, Jan 12 '24 at 16:39
As to whether one can recover the false positives, sure...but only over heterozygous sites, and only about half of them, as they would have to have a different allele from their sister reads. This also means that you already know what the genotypes are...so the value of extracting these reads is dubious.... — Yossi Farjoun, Jan 12 '24 at 16:41
The "heterogeneity" metric only tells you what fraction of the duplicate sets of size 2 & 3 were "heterozygous", i.e. didn't have reads that agreed on the allele. it's a step in the direction of getting the independent replication rate, but not the whole story (see explanation in the answer) — Yossi Farjoun, Jan 12 '24 at 16:43
Thanks for clarifying @Yossi and modifying your answer. This is great. I'll have to think through the stats a bit more to get a final estimate on the FP rate from what I have (after I'm done traveling), but this tool is a nice convenience to have when thinking about this problem. — Ricky, Jan 17 '24 at 09:24
Sure thing @Ricky, and if you have any ideas of how to make the tool more accessible, or useful, I'm all ears.... — Yossi Farjoun, Jan 25 '24 at 20:12

How many false positive duplicates are marked using just the position of first unclipped base?

1 Answers1