3

I am doing a project that requires bench-marking the performance of computational tools for alignment and variant calling of human genome sequencing data. In particular I would like to establish benchmarks comparing the performance of GATK and ADAM/Avocado.

My main question is which genomic data-sets should I use as the input to the data processing pipeline. In Frank Nothaft's (author of ADAM and Avodaco) thesis he reports bench-marking the processing of NA12878 from the 1,000 genome's project in Avocado as compared to GATK's HalotypeCaller. He also reports benchmarking on 270 samples from the Simons Genome Diversity Data-set.

I am interested in reproducing this portion of his work, but do not know where I can get a hold of these datasets. Where can I get a copy of the NA12878 genome (BAM file) and also were can I get aligned samples from Simons Genome Diversity Data-set?

Jon Deaton
  • 399
  • 2
  • 10

1 Answers1

2

NA12878

  • My recommendation: download raw reads from Illumina BaseSpace and do alignment yourself. Aligning ~40X worth of human reads takes overnight with ~16 CPU cores.

  • You can acquire Platinum Genome data from ENA. There is a BAM available for download, but that was done by Illumina's own aligner against a wrong reference genome. You'd better do alignment by yourself. Another problem with this dataset is it is too good to be representative. Modern data are worse.

  • GIAB provides BAMs here:

    ftp://ftp-trace.ncbi.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/
    

    It is not "typical", either.

SGDP

  • Instructions on downloading SGDP BAMs are available on this page, in particular this Word document. Warning: you need ~30TB disk space for these BAMs. I also heard it is tricky to download data.

  • There is apparently a mirror from cancer genomics cloud. I don't know how easy it is to access.

user172818
  • 6,515
  • 2
  • 13
  • 29
  • I'm having some trouble finding the fastq files for NA12878 on Illumina BaseSpace. Do you have any suggestions on where to find these? – Jon Deaton Jun 25 '18 at 17:40