4

Some programs (e.g. shapeit4) automatically annotate an INFO tag into a .vcf file which gives the cumulative genetic distance in cM between each SNP:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=10/11/2020 - 21:29:39
##source=G2H
##contig=<ID=9>
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AC,Number=1,Type=Integer,Description="Allele count">
##INFO=<ID=CM,Number=A,Type=Float,Description="Interpolated cM position">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Phased genotypes">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##bcftools_viewVersion=1.9-207-g2299ab6+htslib-1.9-271-g6738132
##bcftools_viewCommand=view -s -1_-1 ukb_cal_chr9_v2.HumanOrigins.common_snps.HumanOriginsInds.fiftyPercentEthnicUKBInds.PHASED.bgz.vcf.gz; Date=Mon May 24 14:15:51 2021
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  -1_-1
9   257034  rs573860    A   G   .   .   AC=0;AF=0.388699;CM=0;AN=2  GT  0|0
9   270321  rs17720310  A   G   .   .   AC=0;AF=0.0820595;CM=0.007994;AN=2  GT  0|0

However, this program is really used for phasing not for .vcf annotation.

Ideally I'd like a simple command line program which takes input the .vcf and a genetic map and interpolates the genetic distance and adds a cumulative map. This could be done manually with a custom script, but quite frankly I can't be bothered to do that.

Does anyone know of a tool that exists which will do this?

I posted this to biostars a while ago and received one reply which linked to this post, which includes links to tables containing cM locations for different rs numbers.

EDIT: I am interested in doing this for human data (although the same would be the case for any diploid organism).

Sample vcf:

https://pastebin.com/raw/YSbuU0qM

user438383
  • 1,679
  • 1
  • 8
  • 21

1 Answers1

2

I think a useful tool to address your question is QCTOOL v2. Particularly, the option you are looking for is -annotate-genetic-map. This tool has been developed originally for GWAS datasets. It has multiple inputs data format.

If you could provide input data as an example, I will try working on it to provide the necessary steps to be done.

Hope this answers your question.

user324810
  • 1,115
  • 5
  • 21
  • Sorry took a while, have added the sample vcf – user438383 May 31 '21 at 15:07
  • @user438383 could you please specify whether your VCF is of hg38 or hg19? Also, do provide a proper VCF format (it is missing some columns like FORMAT followed by another column like in your original post example ). I tried using the file you provided, it is giving an error of segmentation fault. When I tried to add a vcf header with the proper format aforementioned, it works. – user324810 Jun 01 '21 at 09:44