Output from vcftools missingness

Question

I'm new to data filtering on vcf data and vcftools.
I performed variant calling on my dataset, CHR22, homo sapiens. I'd like to remove sites that are missing in more than 5% of individuals.

vcftools --missing-site --vcf updated_ids68.vcf

This gives me a file with out.lmiss. That is missingness per locus.
I had 68 individuals in my input vcf file.

CHR     POS     N_DATA  N_GENOTYPE_FILTERED     N_MISS  F_MISS  
22      16848278        106     0       30      0.283019  
22      16848492        68      0       68      1  
22      16849180        69      0       67      0.971014  
22      16849229        68      0       68      1  
22      16849376        133     0       3       0.0225564  
22      16849476        132     0       6       0.0454545  
22      16851734        126     0       36      0.285714  
22      16852588        123     0       13      0.105691

I'm unable to understand the output here. What does third column - N_DATA tell?
133, 126 doesn't tell that this is number of individuals if my input had 68 individuals only.

VCFtools (0.1.15)
fileformat=VCFv4.1

spvensko · Accepted Answer · 2019-01-29T02:22:37.567

4

Humans are diploid, so you can expect to see up to 2*N (2*68=136) alleles, so N_DATA is the number of observances of that allele.

edited Jan 29 '19 at 02:22

answered Jan 28 '19 at 20:45

spvensko

56
3

1

Furthermore, it also looks like N_MISS + N_DATA = 2*68 – winni2k Jan 29 '19 at 11:11
2

My initial answer mentioned that, but that doesn't seem to hold true for all rows. Consider the following rows from their table:
22 16849476 132 0 6 0.0454545
22 16851734 126 0 36 0.285714

The N_DATA and N_MISS instead sum to 138 and 162, respectively.
– spvensko Jan 29 '19 at 14:48
1

Right. That is surprising. What could cause that? – winni2k Jan 29 '19 at 15:10
@spvensko Great catch. My data are structural variants, maybe that is why? Thanks. – Death Metal Jan 29 '19 at 21:50

Output from vcftools missingness

1 Answers1