I successfully completed Nature PRS tutorial, which is based on PLINK. Turning to my real data, I downloaded ukb-d-20544_1.vcf.gz. Now I'm facing the problem that I seem to be unable to use it in PLINK or find the correct data format to download at all, and I am a bit confused about the correct general summary statistics-download procedure.
1. Convert .vcf format to a .txt file, PLINK can understand
In the tutorial they use a .txt.gz base data file aka summary statistics that looks like this:
CHR BP SNP A1 A2 N SE P OR INFO MAF
1 756604 rs3131962 A G 388028 0.00301666 0.483171 0.997886915712657 0.890557941364774 0.369389592764921
1 768448 rs12562034 A G 388028 0.00329472 0.834808 1.00068731609353 0.895893511351165 0.336845754096289
...
Having this .vcf.gz (please see excerpt at the very bottom), I unzipped and, in R, used MungeSumstats::format_sumstats("ukb-d-20544_1.vcf", ref_genome = "GRCh37")[1], to get something out of it similar to what they use in the .txt file in the tutorial. However, the column names won't match exactly. I know that I can use BETA instead of OR, and FRQ might correspond to MAF, but I can't seem to find the imputation information score INFO. Here is what MungeSumstats::format_sumstats produced:
SNP CHR BP A1 A2 END FILTER FRQ BETA SE LP N P
rs12238997 1 693731 A G 693731 PASS 0.116734 0.00058531 0.000756397 0.357493 117706 0.439042942120051
rs371890604 1 707522 G C 707522 PASS 0.0983981 0.000793177 0.000848559 0.456023 117706 0.349926634613021
...
I need to have the columns corresponding to MAF and INFO as per the tutorial.
The next step in the tutorial involves filtering based on specific criteria, such as MAF and INFO values. However, I can't proceed because these columns are missing.
I would like to primarily use the terminal and only turn to R for statistical analysis. Could you guide me on the correct approach to convert the .vcf file to .txt so that I can continue following the tutorial with PLINK?
2. General approach
In addition to addressing the conversion issue, I'm also seeking guidance on the appropriate data sources for genetic analysis. I'm curious about whether OpenGWAS is a reliable source for publication or if there are other recommended sources like dbGaP. Additionally, what is the ideal data format to obtain?
The beginning of the received .vcf file looks like this:
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##FORMAT=<ID=ES,Number=A,Type=Float,Description="Effect size estimate relative to the alternative allele">
##FORMAT=<ID=SE,Number=A,Type=Float,Description="Standard error of effect size estimate">
##FORMAT=<ID=LP,Number=A,Type=Float,Description="-log10 p-value for effect estimate">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Alternate allele frequency in the association study">
##FORMAT=<ID=SS,Number=A,Type=Float,Description="Sample size used to estimate genetic effect">
##FORMAT=<ID=EZ,Number=A,Type=Float,Description="Z-score provided if it was used to derive the EFFECT and SE fields">
##FORMAT=<ID=SI,Number=A,Type=Float,Description="Accuracy score of summary data imputation">
##FORMAT=<ID=NC,Number=A,Type=Float,Description="Number of cases used to estimate genetic effect">
##FORMAT=<ID=ID,Number=1,Type=String,Description="Study variant identifier">
##META=<ID=TotalVariants,Number=1,Type=Integer,Description="Total number of variants in input">
##META=<ID=VariantsNotRead,Number=1,Type=Integer,Description="Number of variants that could not be read">
##META=<ID=HarmonisedVariants,Number=1,Type=Integer,Description="Total number of harmonised variants">
##META=<ID=VariantsNotHarmonised,Number=1,Type=Integer,Description="Total number of variants that could not be harmonised">
##META=<ID=SwitchedAlleles,Number=1,Type=Integer,Description="Total number of variants strand switched">
##META=<ID=TotalControls,Number=1,Type=Integer,Description="Total number of controls in the association study">
##META=<ID=TotalCases,Number=1,Type=Integer,Description="Total number of cases in the association study">
##META=<ID=StudyType,Number=1,Type=String,Description="Type of GWAS study [Continuous or CaseControl]">
##SAMPLE=<ID=ukb-d-20544_1,TotalVariants=9938650,VariantsNotRead=0,HarmonisedVariants=9938650,VariantsNotHarmonised=0,SwitchedAlleles=209,StudyType=Continuous>
##contig=<ID=1,length=249250621,assembly=HG19/GRCh37>
##contig=<ID=2,length=243199373,assembly=HG19/GRCh37>
...
##contig=<ID=22,length=51304566,assembly=HG19/GRCh37>
##contig=<ID=X,length=155270560,assembly=HG19/GRCh37>
##contig=<ID=Y,length=59373566,assembly=HG19/GRCh37>
##contig=<ID=MT,length=16569,assembly=HG19/GRCh37>
##contig=<ID=GL000207.1,length=4262,assembly=HG19/GRCh37>
##contig=<ID=GL000226.1,length=15008,assembly=HG19/GRCh37>
...
##contig=<ID=GL000192.1,length=547496,assembly=HG19/GRCh37>
##gwas_harmonisation_command=--json /mnt/storage/private/mrcieu/research/scratch/IGD/data/dev/ukb-d-import/processed/ukb-d-20544_1/ukb-d-20544_1_data.json --ref /mnt/storage/private/mrcieu/research/scratch/IGD/data/dev/QC/genomes/b37/human_g1k_v37.fasta; 1.1.1
##file_date=2019-11-25T15:12:05.258952
##bcftools_annotateVersion=1.9-74-g6af271c+htslib-1.9-64-g226b4a8
##bcftools_annotateCommand=annotate -a /mnt/storage/home/gh13047/mr-eve/vcf-reference-datasets/dbsnp/dbsnp.v153.b37.vcf.gz -c ID -o /mnt/storage/private/mrcieu/research/scratch/IGD/data/dev/ukb-d-import/processed/ukb-d-20544_1/ukb-d-20544_1.vcf.gz -O z /mnt/storage/private/mrcieu/research/scratch/IGD/data/dev/ukb-d-import/processed/ukb-d-20544_1/ukb-d-20544_1_data.vcf.gz; Date=Mon Nov 25 15:37:00 2019
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ukb-d-20544_1
1 692794 rs530212009 CA C . PASS AF=0.111693 ES:SE:LP:AF:SS:ID 0.000379398:0.000798869:0.197332:0.111693:117706:1_692794_CA_C
1 693731 rs12238997 A G . PASS AF=0.116734 ES:SE:LP:AF:SS:ID 0.00058531:0.000756397:0.357493:0.116734:117706:rs12238997