5

I have population SNP data in VCF format, and I found that some SNPs have a great similarity across samples(> 99%). For example:

CHROM_POS   s0  s1  s2  s3  s4  s5  s6  s7  s8  s9
chr1_1  A   G   G   A   G   G   A   A   G   A
chr1_2  A   G   G   A   G   G   A   A   G   T
chr1_3  C   C   C   A   C   A   C   C   C   C

The similarity between chr1_1 and chr1_2 is 0.9, because they have a different SNP in sample s9.

Is there a good way to remove these similar SNPs before I put the data into another pipeline?

Added: These SNPs will be used for GWAS analysis.

James Hawley
  • 1,384
  • 7
  • 21
l0o0
  • 325
  • 1
  • 8
  • Do you mean that some SNPs have a MAF of <1%? – Emily_Ensembl Sep 05 '17 at 08:11
  • @Emily_Ensembl MAF is about allele frequency. But I want to remove similar SNP sequence across sampls . – l0o0 Sep 05 '17 at 08:16
  • The question is about if removing common SNPs a good thing. – SmallChess Sep 05 '17 at 08:28
  • @SmallChess, I think similar snps may will provide duplicate information and cost more resource to compute. Maybe similar snp in long distance should be kept? – l0o0 Sep 05 '17 at 08:34
  • 2
    It wasn't clear from your original question what you meant. Now you've edited it I can see that you're looking at haplotypic blocks with SNPs in LD. No, don't remove them. There may be causal SNPs within these blocks – you won't able to tell which SNP is causal but you can identify that the block is important. – Emily_Ensembl Sep 05 '17 at 08:56
  • @Emily_Ensembl, I am not familiar with haplotypic block. SNPs in this block is positional adjacent? If they are adjacent, I would consider the distance between them, beside their similarity. – l0o0 Sep 05 '17 at 09:09
  • Variant that are adjacent are more likely to have alleles inherited together, due to less crossing over between them in meiosis. Pairs of variants are referred to as being in Linkage Disequilibrium (or LD) and blocks of them are called haplotypic blocks. – Emily_Ensembl Sep 05 '17 at 09:46

1 Answers1

6

What you are attempting to do is known as LD-pruning.
As @Emily_Ensembl said, it is not customary to do this for standard association tests: it is possible that one of the SNPs you remove is causal, or a better proxy for the causal locus, and would give (slightly) better association signal than the other. Even for SNPs in perfect LD, pruning is unwise because it would complicate interpretation: if SNP chr1_2 had some obvious function, but you removed it and are left only with its proxy chr1_1 (which could be far away), you will have a hard time connecting the dots. Typical GWAS tests are handled pretty quickly on modern computers, and are linear in SNPs, so there is no computational reason to prune either.

On the other hand, many fancier procedures would require LD-pruned data, because they either
a) are very computationally expensive, or
b) assume independent SNPs.

I suggest reading PLINK's LD pruning documentation, as it pretty much covers the standard ways to do this.

juod
  • 473
  • 2
  • 6