low-memory high-speed replacement for Picard MarkDuplicates

Question

I am running Picard MarkDuplicates with the following parameters below. On the file described, it takes about 41.6Gb of RAM memory and about 20-25 minutes to compute (only uses 1 core AFAICS).

java -Xmx54890m -jar picard.jar MarkDuplicates \
    I=/home/dnanexus/in/sorted_bam/TST73-86-2IC_S16_L00.bml.GRCh38.karyo.bam \
    O=./TST73-86-2IC_S16_L00.bml.GRCh38.karyo.deduplicated.bam \
    M=./TST73-86-2IC_S16_L00.bml.GRCh38.karyo.duplication_metrics \
    CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT REMOVE_DUPLICATES=true

What equivalent free/open source software (samtools?, sambamba?) can I use to obtain the same deduplicated bam file but at a lower cost, considering the instance used here has 8 cores but only one of them is being used, and has about 68Gb of RAM out of which it's using 46Gb odd at peak.

I think previous versions of samtools/sambamba didn't behave the same way as Picard MarkDuplicates with the parameters above, but I would like to know which recent versions would give me the same output.

I assume you only want free (in terms of price, at least) software, right? Sentieon have a tool for this that is orders of magnitude faster (and I would guess uses less RAM but I'd have to check) but you need to pay for it. — terdon, Dec 11 '17 at 15:58
I done no benchmark, but I guess samblaster is bit faster (by my subjective perception of time). — Kamil S Jaron, Dec 11 '17 at 20:24
You should specify a smaller "-Xmx". MarkDuplicate doesn't need that much memory, but seeing a large "-Xmx", java will use extra. — user172818, Dec 12 '17 at 15:21

score 2 · Answer 1 · answered Dec 11 '17 at 20:05

2

As you suggested, sambamba is faster at marking duplicates than picard (it's also multithreaded). Recent versions of samtools have a rewritten duplicate marking algorithm, though I doubt it'll be as quick as sambamba. 46GB of RAM seems excessive for marking duplicates unless you're having it store the whole file in memory.

answered Dec 11 '17 at 20:05

Devon Ryan

19,602
2
29
60

Does sambamba with default parameters equate to Picard MarkDuplicates? – 719016 Dec 12 '17 at 21:50
I'm not sure that any tool exactly equates (there are a lot of edge cases in duplicate marking and picard misses some things), but it's a sufficient replacement. Sambamba is also used for this purpose by a number of international consortia (I think either DEEP or IHEC, if I remember correctly). – Devon Ryan Dec 12 '17 at 23:55

low-memory high-speed replacement for Picard MarkDuplicates

1 Answers1