There's a tool in Picard (full disclaimer: I wrote it...) called CollectIndependentReplicateMetrics which performs the test you are looking for, I think.
It measures the percent of duplication that is "independent", i.e. which arose from randomly sampled original fragments that happen to be duplicates, rather than due to technical (optical, PCR, bridge-amp...) reasons.
The way it works is by looking at duplicate-sets of size 2 (and separately, 3) over heterozygous sites in the bam (which is assumed to be a diploid sample) it classifies the sets into "homogeneous" and "heterogeneous" based on the alleles of the hetrozygous variant. From the relative sizes of these two classes, one can work out the percent of duplication that is "independent" since independent duplication will give rise to heterogeneous sets (sometimes), while dependent duplication will never do that.
The assumptions are quite simple actually:
- a technical duplicate will have the same allele
- an independent replicate will have the same allele 50% of the time
So the rate of independence is 2x the rate of heterozygosity, within duplicate sets of size 2.
For sets of size 3 the math is a little more complex, so I wrote it up and docuemnted it.
Here's the verbatim explanation I wrote at the time (in the source-code of the Metric file...):
Explanation of calculation of independent replication rate from heterozygosity rate (within triplicate sets):
We assume the there are two types of replication events:
- those that are "independent", such that we just happen to get 2 fragments from the exact same region
(These get a random allele so effectively change the allele with probability 0.5), and
- those that are "artifactual" (do not change the allele)
We skip sets that have unexpected alleles as they do not fit our model. In the following we use the term "duplicates" to indicate that
the read-pairs would be marked as duplicates, not that they actually are technical duplicates.
To reach a triplicate set, 2 replication events are required so there are the following options, assuming
that the independent replication rate is x (thus the artifactual is 1-x). We assume a diploid organism with no bias towards either allele
in heterozygous sites, so an idependent replication will result in the other alleles in half the cases:
0 -> 0, 1 happens with probability x/2, therefore
0 -> 0, 0 happens with probability 1-x/2
(This is the explanation that is required for calculating the independent replication rate from doubleton sets...quite simpler)
Without loss of generality we assume that we "start" with allele 0 and that 1 is the other allele.
Each of the resulting alleles ("First" or "second") can replicate (each with probability 0.5) so we get:
from first row:
0=>1 1 (0 replicate to a 1) with probability x/2 * 0.5 * x/2
0=>0 1 (0 replicate to a 0) with probability x/2 * 0.5 * (1-x/2)
=====subtotal = x/4
0 1=>0 (1 replicate to a 0) with probability x/2 * 0.5 * x/2
0 1=>1 (1 replicate to a 1) with probability x/2 * 0.5 * (1-x/2)
=====subtotal = x/4
from second row:
0=>1 0 (first 0 replicate to a 1) with probability (1-x/2) * 0.5 * x/2
0=>0 0 (first 0 replicate to a 0) with probability (1-x/2) * 0.5 * (1-x/2) <======= Homogeneous set
=====subtotal = (1-x/2)/2
0 0=>1 (second 0 replicate to a 1) with probability (1-x/2) * 0.5 * x/2
0 0=>0 (second 0 replicate to a 0) with probability (1-x/2) * 0.5 * (1-x/2) <======= Homogeneous set
=====subtotal = (1-x/2)/2
total is x/2 (from first two sub-totals) + (1-x/2) (from last two sub-totals) = 1
We differentiate between a heterogeneous result (with 0's and 1's in the set) and homogeneous results
(all zeros, since we assumed WLOG that we start with 0)
The probability of a homogeneous set is therefore the sum of the two only homogeneous results
P(hom) = (1-x/2) * 0.5 * (1-x/2) + (1-x/2) * 0.5 * (1-x/2) = (1-x/2)^2 = y
where y = P(hom) is the rate of homogeneity within triplicate sets.
Solving for x we find that 2 * ( 1 - sqrt(y) ) = x.
biSiteHeterogeneityRatein the range 0.000544 to 0.000751 andtriSiteHeterogeneityRate0.001005 to 0.001262. If I'm understanding correctly, does this suggest (if my one sample is representative) that for this particular sequencing process we'd expect something like 0.05%-0.13% false positive duplication rate (so around 1/1000 duplicate reads)? – Ricky Jan 12 '24 at 04:31independentReplicationRateFromTriDupsandindependentReplicationRateFromBiDups. The tell you the estimated fraction of duplication which is not really a duplicate. I.e it came from a different original molecule. – Yossi Farjoun Jan 12 '24 at 16:39