Compressing the human genome to few megabytes

Question

Multiple sources (see for instance this or this) discuss how genetic data will have scalability problems, given the huge file size of the human genome. The most straightforward encoding (see here) of the human genome requires about 700 Mb.

I came across this paper claiming to be able to store the human genome in about 4 Mb, having a reference genome and exploiting the fact that all human genomes are mostly equal. This paper is much older than the the other references discussing the scalability problems. Why is this technique not widely used?

https://bioinformatics.stackexchange.com/questions/12824/theoretical-limit-of-human-genome-compression?noredirect=1&lq=1 — Alex Reynolds, Sep 12 '21 at 22:35

score 3 · Answer 1 · answered Jul 14 '21 at 11:08

Answer from @alex-reynolds, converted from comments:

As with any compression scheme, if you have to store a reference genome, along with the 4 Mb "diff", is it really compressed?

http://mattmahoney.net/dc/dce.html

For sequencing data, you might also look at the CRAM format, which also uses diffs against a reference genome to store reduced datasets. Labs are switching from BAM to CRAM to manage storage of increased numbers of sequencing datasets, with associated changes to pipelines and toolkits to use this format. For performance reasons, BAM may be preferred for analysis, and CRAM for long-term storage.

https://gatk.broadinstitute.org/hc/en-us/articles/360035890791-SAM-or-BAM-or-CRAM-Mapped-sequence-data-formats

https://www.ga4gh.org/news/cram-compression-for-genomics/

score 1 · Answer 2 · answered Aug 13 '21 at 13:35

I saw this paper a few weeks back that claims to offer 4x more compression than typical formats for both fastqs and BAM/CRAM files.

Genozip - A Universal Extensible Genomic Data Compressor

https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab102/6135077

Compressing the human genome to few megabytes

2 Answers2

Linked