11

We're considering switching our storage format from BAM to CRAM. We work with human cancer samples, which may have very low prevalence variants (i.e. not diploid frequency).

If we use lossy CRAM to save more space, how much will variants called from those CRAM files change? Which compression strategy has the lowest impact?

Are there any other impacts on downstream tools that we're not considering?

morgantaschuk
  • 530
  • 4
  • 9
  • 1
    CRAM doesn't need to be lossy, is there a reason you need it to be? – Devon Ryan Jun 08 '17 at 14:58
  • 2
    Saving disk space. We pay by the GB and need to keep the data around for 10 years. – morgantaschuk Jun 08 '17 at 15:02
  • Can't argue that budget isn't a good reason :) – Devon Ryan Jun 08 '17 at 15:08
  • 1
    Interesting question. I think this is the kind of thing that makes a nice side project. Take a bam file call the variants, transform it to cram and run the variant caller. Measure the difference and the variant concordance between the two approaches using a number of different files. – eastafri Jun 08 '17 at 15:16
  • If one is concerned about absolute integrity/reproducibility of the data then budget consideration is not a good reason. – GenoMax Jun 08 '17 at 16:54
  • Unless the answer is: some types of lossy compression don't have any impact on variant calls. – morgantaschuk Jun 08 '17 at 22:08
  • 1
    BAM files are zipped with standard gzip compression. Unzip them to "naked BAM" - not my terminology - and re-zip them with something stronger like 7zip/LZMA. You can always re-zip them again with the bgzip tool when you need them back in true BAM format again. This gets you most of the way there filesize-wise without really changing the format, which could be good if you've got things set up how you like. Not an answer because it doesn't answer your question but it might solve your problem. – J.J Jun 08 '17 at 22:32

2 Answers2

6

By default, a CRAM you create with samtools is lossless. It typically halves the input BAM in terms of file size. If you want to compress more, you can let samtools convert most read names to integers. You won't be able to tell optical duplicates from read names, but this is a minor concern. You can also drop useless tags depending on your mapper and the downstream caller in use. For cancer data, I wouldn't reduce the resolution of base quality without comprehensive benchmarks. Unfortunately, base quality takes most of space in CRAM. Discarding the original read names and some tags probably won't save you much space.

user172818
  • 6,515
  • 2
  • 13
  • 29
  • These are all great suggestions for reducing file size without losing information, but don't address the main question: the effect of lossiness on variant calls. – Daniel Standage Jun 08 '17 at 16:24
  • 1
    @DanielS If you don't touch bases and qualities and name pairing, you won't change the variant calls. – user172818 Jun 08 '17 at 16:27
  • Yes, but then that's not really lossy, is it? Doesn't lossy compression conventionally involve changing the sequence and/or quality values for greater compression efficiency? – Daniel Standage Jun 08 '17 at 16:30
  • 1
    That depends on the definition of "lossy" :) To me, losing read names and tags is lossy. – user172818 Jun 08 '17 at 16:41
  • ¯\(ツ)/¯ You yourself said that calls shouldn't change if the sequence and quality are unchanged. So everything else is ancillary. Don't get me wrong, I think it's valuable to point out that it's possible to reduce file size without changing sequence or quality, but it seemed pretty clear to me that the OP was talking about lossy compression of the sequence and/or quality scores. – Daniel Standage Jun 08 '17 at 16:49
  • Then again, this answer is a reasonable response to the question "Which compression strategy has the lowest impact?" Ok, I take it all back! :-) – Daniel Standage Jun 08 '17 at 16:51
3

The main concern has always been the "binning" of quality scores that occurs via CRAM compression (and is also standard on the HiSeqX, HiSeq4000, and NovaSeq platforms). Anecdotally, I can report very little difference between 4-bin quality scores and full quality scores on cancer samples, though I don't know if I've seen a direct head-to-head comparison.

chrisamiller
  • 530
  • 4
  • 6
  • +1. My colleagues have done some benchmarks to show 4-bin has little effect on germline samples. I have seen the similar. Cancer samples always make me wary, though. It would be great if someone do a systematic evaluation on cancer samples. I haven't seen one so far. – user172818 Jun 09 '17 at 16:32