4

I want to use GATK to estimate cross-sample contamination for Whole Genome Sequencing data.

The specific tool is ContEst and it is run with:

java
 -jar GenomeAnalysisTK.jar \
 -T ContEst \
 -R reference.fasta \
 -I:eval tumor.bam \
 -I:genotype normal.bam \
 --popfile populationAlleleFrequencies.vcf \
 -L populationSites.interval_list
 [-L targets.interval_list] \
 -isr INTERSECTION \
 -o output.txt

The --popfile option requires a vcf file representing population allele frequencies. Where can I get it? Is it the one I can get from dbSNP?

I'm using the human_g1k_v37.fasta reference, a version of hg19, "for the phase1 analysis we mapped to GRCh37. Our fasta file which can be found here called human_g1k_v37.fasta.gz, it contains the autosomes, X, Y and MT but no haplotype sequence or EBV".

Daniel Standage
  • 5,080
  • 15
  • 50
gc5
  • 1,783
  • 18
  • 32

1 Answers1

5

On the GATK forum they've recommended the population stratified VCF file for this purpose.

Devon Ryan
  • 19,602
  • 2
  • 29
  • 60