7

I am running Picard MarkDuplicates with the following parameters below. On the file described, it takes about 41.6Gb of RAM memory and about 20-25 minutes to compute (only uses 1 core AFAICS).

java -Xmx54890m -jar picard.jar MarkDuplicates \
    I=/home/dnanexus/in/sorted_bam/TST73-86-2IC_S16_L00.bml.GRCh38.karyo.bam \
    O=./TST73-86-2IC_S16_L00.bml.GRCh38.karyo.deduplicated.bam \
    M=./TST73-86-2IC_S16_L00.bml.GRCh38.karyo.duplication_metrics \
    CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT REMOVE_DUPLICATES=true

What equivalent free/open source software (samtools?, sambamba?) can I use to obtain the same deduplicated bam file but at a lower cost, considering the instance used here has 8 cores but only one of them is being used, and has about 68Gb of RAM out of which it's using 46Gb odd at peak.

I think previous versions of samtools/sambamba didn't behave the same way as Picard MarkDuplicates with the parameters above, but I would like to know which recent versions would give me the same output.

gringer
  • 14,012
  • 5
  • 23
  • 79
719016
  • 2,324
  • 13
  • 19

1 Answers1

2

As you suggested, sambamba is faster at marking duplicates than picard (it's also multithreaded). Recent versions of samtools have a rewritten duplicate marking algorithm, though I doubt it'll be as quick as sambamba. 46GB of RAM seems excessive for marking duplicates unless you're having it store the whole file in memory.

Devon Ryan
  • 19,602
  • 2
  • 29
  • 60
  • Does sambamba with default parameters equate to Picard MarkDuplicates? – 719016 Dec 12 '17 at 21:50
  • I'm not sure that any tool exactly equates (there are a lot of edge cases in duplicate marking and picard misses some things), but it's a sufficient replacement. Sambamba is also used for this purpose by a number of international consortia (I think either DEEP or IHEC, if I remember correctly). – Devon Ryan Dec 12 '17 at 23:55