3

I'm new to data filtering on vcf data and vcftools.
I performed variant calling on my dataset, CHR22, homo sapiens. I'd like to remove sites that are missing in more than 5% of individuals.

vcftools --missing-site --vcf updated_ids68.vcf

This gives me a file with out.lmiss. That is missingness per locus.
I had 68 individuals in my input vcf file.

CHR     POS     N_DATA  N_GENOTYPE_FILTERED     N_MISS  F_MISS  
22      16848278        106     0       30      0.283019  
22      16848492        68      0       68      1  
22      16849180        69      0       67      0.971014  
22      16849229        68      0       68      1  
22      16849376        133     0       3       0.0225564  
22      16849476        132     0       6       0.0454545  
22      16851734        126     0       36      0.285714  
22      16852588        123     0       13      0.105691

I'm unable to understand the output here. What does third column - N_DATA tell?
133, 126 doesn't tell that this is number of individuals if my input had 68 individuals only.

VCFtools (0.1.15)
fileformat=VCFv4.1

Daniel Standage
  • 5,080
  • 15
  • 50
Death Metal
  • 265
  • 1
  • 7

1 Answers1

4

Humans are diploid, so you can expect to see up to 2*N (2*68=136) alleles, so N_DATA is the number of observances of that allele.

spvensko
  • 56
  • 3
  • 1
    Furthermore, it also looks like N_MISS + N_DATA = 2*68 – winni2k Jan 29 '19 at 11:11
  • 2
    My initial answer mentioned that, but that doesn't seem to hold true for all rows. Consider the following rows from their table:

    22 16849476 132 0 6 0.0454545
    22 16851734 126 0 36 0.285714

    The N_DATA and N_MISS instead sum to 138 and 162, respectively.

    – spvensko Jan 29 '19 at 14:48
  • 1
    Right. That is surprising. What could cause that? – winni2k Jan 29 '19 at 15:10
  • @spvensko Great catch. My data are structural variants, maybe that is why? Thanks. – Death Metal Jan 29 '19 at 21:50