Converting VCF format to text for use with PLINK and understanding column mapping

Question

I successfully completed Nature PRS tutorial, which is based on PLINK. Turning to my real data, I downloaded ukb-d-20544_1.vcf.gz. Now I'm facing the problem that I seem to be unable to use it in PLINK or find the correct data format to download at all, and I am a bit confused about the correct general summary statistics-download procedure.

1. Convert .vcf format to a .txt file, PLINK can understand

In the tutorial they use a .txt.gz base data file aka summary statistics that looks like this:

CHR BP  SNP A1  A2  N   SE  P   OR  INFO    MAF
1   756604  rs3131962   A   G   388028  0.00301666  0.483171    0.997886915712657   0.890557941364774   0.369389592764921
1   768448  rs12562034  A   G   388028  0.00329472  0.834808    1.00068731609353    0.895893511351165   0.336845754096289
...

Having this .vcf.gz (please see excerpt at the very bottom), I unzipped and, in R, used MungeSumstats::format_sumstats("ukb-d-20544_1.vcf", ref_genome = "GRCh37")[1], to get something out of it similar to what they use in the .txt file in the tutorial. However, the column names won't match exactly. I know that I can use BETA instead of OR, and FRQ might correspond to MAF, but I can't seem to find the imputation information score INFO. Here is what MungeSumstats::format_sumstats produced:

SNP CHR BP  A1  A2  END FILTER  FRQ BETA    SE  LP  N   P
rs12238997  1   693731  A   G   693731  PASS    0.116734    0.00058531  0.000756397 0.357493    117706  0.439042942120051
rs371890604 1   707522  G   C   707522  PASS    0.0983981   0.000793177 0.000848559 0.456023    117706  0.349926634613021
...

I need to have the columns corresponding to MAF and INFO as per the tutorial.

The next step in the tutorial involves filtering based on specific criteria, such as MAF and INFO values. However, I can't proceed because these columns are missing.

I would like to primarily use the terminal and only turn to R for statistical analysis. Could you guide me on the correct approach to convert the .vcf file to .txt so that I can continue following the tutorial with PLINK?

2. General approach

In addition to addressing the conversion issue, I'm also seeking guidance on the appropriate data sources for genetic analysis. I'm curious about whether OpenGWAS is a reliable source for publication or if there are other recommended sources like dbGaP. Additionally, what is the ideal data format to obtain?

The beginning of the received .vcf file looks like this:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##FORMAT=<ID=ES,Number=A,Type=Float,Description="Effect size estimate relative to the alternative allele">
##FORMAT=<ID=SE,Number=A,Type=Float,Description="Standard error of effect size estimate">
##FORMAT=<ID=LP,Number=A,Type=Float,Description="-log10 p-value for effect estimate">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Alternate allele frequency in the association study">
##FORMAT=<ID=SS,Number=A,Type=Float,Description="Sample size used to estimate genetic effect">
##FORMAT=<ID=EZ,Number=A,Type=Float,Description="Z-score provided if it was used to derive the EFFECT and SE fields">
##FORMAT=<ID=SI,Number=A,Type=Float,Description="Accuracy score of summary data imputation">
##FORMAT=<ID=NC,Number=A,Type=Float,Description="Number of cases used to estimate genetic effect">
##FORMAT=<ID=ID,Number=1,Type=String,Description="Study variant identifier">
##META=<ID=TotalVariants,Number=1,Type=Integer,Description="Total number of variants in input">
##META=<ID=VariantsNotRead,Number=1,Type=Integer,Description="Number of variants that could not be read">
##META=<ID=HarmonisedVariants,Number=1,Type=Integer,Description="Total number of harmonised variants">
##META=<ID=VariantsNotHarmonised,Number=1,Type=Integer,Description="Total number of variants that could not be harmonised">
##META=<ID=SwitchedAlleles,Number=1,Type=Integer,Description="Total number of variants strand switched">
##META=<ID=TotalControls,Number=1,Type=Integer,Description="Total number of controls in the association study">
##META=<ID=TotalCases,Number=1,Type=Integer,Description="Total number of cases in the association study">
##META=<ID=StudyType,Number=1,Type=String,Description="Type of GWAS study [Continuous or CaseControl]">
##SAMPLE=<ID=ukb-d-20544_1,TotalVariants=9938650,VariantsNotRead=0,HarmonisedVariants=9938650,VariantsNotHarmonised=0,SwitchedAlleles=209,StudyType=Continuous>
##contig=<ID=1,length=249250621,assembly=HG19/GRCh37>
##contig=<ID=2,length=243199373,assembly=HG19/GRCh37>
...
##contig=<ID=22,length=51304566,assembly=HG19/GRCh37>
##contig=<ID=X,length=155270560,assembly=HG19/GRCh37>
##contig=<ID=Y,length=59373566,assembly=HG19/GRCh37>
##contig=<ID=MT,length=16569,assembly=HG19/GRCh37>
##contig=<ID=GL000207.1,length=4262,assembly=HG19/GRCh37>
##contig=<ID=GL000226.1,length=15008,assembly=HG19/GRCh37>
...
##contig=<ID=GL000192.1,length=547496,assembly=HG19/GRCh37>
##gwas_harmonisation_command=--json /mnt/storage/private/mrcieu/research/scratch/IGD/data/dev/ukb-d-import/processed/ukb-d-20544_1/ukb-d-20544_1_data.json --ref /mnt/storage/private/mrcieu/research/scratch/IGD/data/dev/QC/genomes/b37/human_g1k_v37.fasta; 1.1.1
##file_date=2019-11-25T15:12:05.258952
##bcftools_annotateVersion=1.9-74-g6af271c+htslib-1.9-64-g226b4a8
##bcftools_annotateCommand=annotate -a /mnt/storage/home/gh13047/mr-eve/vcf-reference-datasets/dbsnp/dbsnp.v153.b37.vcf.gz -c ID -o /mnt/storage/private/mrcieu/research/scratch/IGD/data/dev/ukb-d-import/processed/ukb-d-20544_1/ukb-d-20544_1.vcf.gz -O z /mnt/storage/private/mrcieu/research/scratch/IGD/data/dev/ukb-d-import/processed/ukb-d-20544_1/ukb-d-20544_1_data.vcf.gz; Date=Mon Nov 25 15:37:00 2019
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  ukb-d-20544_1
1       692794  rs530212009     CA      C       .       PASS    AF=0.111693     ES:SE:LP:AF:SS:ID       0.000379398:0.000798869:0.197332:0.111693:117706:1_692794_CA_C
1       693731  rs12238997      A       G       .       PASS    AF=0.116734     ES:SE:LP:AF:SS:ID       0.00058531:0.000756397:0.357493:0.116734:117706:rs12238997

score 2 · Answer 1 · answered Oct 13 '23 at 08:18

For the first part of question 2: if there are other recommended sources like dbGaP.

Please refer to this Databases of GWAS summary statistics:

Database	Content
GWAS Catalog	GWAS summary statistics and GWAS lead SNPs reported in GWAS papers
GeneAtlas	UK Biobank GWAS summary statistics
Pan UKBB	UK Biobank GWAS summary statistics
GWAS Atlas	Collection of publicly available GWAS summary statistics with follow-up in silico analysis
FinnGen results	GWAS summary statistics released from FinnGen, a project that collected biological samples from many sources in Finland
Pheweb.jp	GWAS summary statistics of Biobank Japan and cross-population meta-analyses

https://www.nature.com/articles/s43586-021-00056-9/tables/3

Thanks for pointing to that paper, very helpful +1 – jay.sf Oct 13 '23 at 20:03 — jay.sf, Oct 13 '23 at 20:03

score 2 · Answer 2 · answered Oct 18 '23 at 17:01

However, the column names won't match exactly. I know that I can use BETA instead of OR, and FRQ might correspond to MAF, but I can't seem to find the imputation information score INFO.

Yes. There is not a standard format that everyone reports GWAS summary statistics in, unfortunately. First, check the paper that you are sourcing it from to see if they already filtered on imputation quality. Do ctrl+f and look in the methods section for imputation information (you may also try searching for imputation tools, like minimac or Beagle).

The good news is that if the sumstats have many alleles, imputation accuracy doesn't matter much. Errors get averaged out and each error only has a small effect since there are many alleles. Usually it is expected that reported sumstats will be pre-QC'd in some way, especially if they lack INFO. It would be safe to proceed without the INFO filter step.

The next step in the tutorial involves filtering based on specific criteria, such as MAF and INFO values. However, I can't proceed because these columns are missing.

Sumstat formatting is whatever the author wants, and can be a bit annoying to parse. However, FRQ is basically MAF.

See here (ctrl+f "frq"): https://bioconductor.riken.jp/packages/3.14/bioc/vignettes/MungeSumstats/inst/doc/MungeSumstats.html https://academic.oup.com/bioinformatics/article/37/23/4593/6380562

I would like to primarily use the terminal and only turn to R for statistical analysis. Could you guide me on the correct approach to convert the .vcf file to .txt so that I can continue following the tutorial with PLINK?

Did you try unzipping and simply replacing .vcf with .txt in the filename?

General approach In addition to addressing the conversion issue, I'm also seeking guidance on the appropriate data sources for genetic analysis. I'm curious about whether OpenGWAS is a reliable source for publication or if there are other recommended sources like dbGaP. Additionally, what is the ideal data format to obtain?

This aligns with your first question. If you use a database, sumstats will be standardized (usually) to the database's common format, so there will be less confusion over inconsistent column names. The GWAS catalog pretty good in my opinion, and all of them are acceptable.

Converting VCF format to text for use with PLINK and understanding column mapping

1. Convert .vcf format to a .txt file, PLINK can understand

2. General approach

2 Answers2