What will be an appropriate mathematical distribution for SNP data?

Question

I found that several papers describe SNPs as a binomial distribution with the probability of "success" equals to minor allele frequency.

However, in my experiments, when I generate SNP array following this distribution, the simulation results behave very differently from the SNP array generated following the random mating procedures. I feel like there is some missing information about the description of a binomial distribution.

I wonder what will be a better mathematical description of the SNP array data? Besides a direct answer, any thoughts, or any suggestion of papers are also highly welcome.

EDIT:

Thanks for these suggestions, after reading relevant materials, I think what I observe is about the Hardy-Weinberg equilibrium affected by evolutionary influences like mutation and migration. I will continue to look and list findings here if I find anything interesting. Please feel free to suggest some mathematical descriptions.

Also thanks for the suggestion of tools. They look very helpful, but I really hope to understand some mathematics behind it.

Are you familiar with linkage disequilibrium and Hardy-Weinberg equilibrium? Looking up review papers on those might give more insights. — gringer, Sep 01 '17 at 05:43
Re the edit: given the very low mutation rates in eukaryotes, mutation effect on HWE in SNPs should be negligible - or at least, much lower than the effect of genotyping errors. — juod, Sep 02 '17 at 09:00

score 6 · Answer 1 · answered Sep 01 '17 at 10:03

Simulating genotypes with realistic correlation structures is indeed not so simple, and there's quite a few papers dedicated entirely to that (e.g. https://bmcgenet.biomedcentral.com/articles/10.1186/s12863-015-0173-4). Also, DEPICT (https://data.broadinstitute.org/mpg/depict/index.html) comes with a number of simulated GWASs to generate the nulls, so that's a simple way to obtain some ready-made fake data.

On the other hand, indepedent SNPs should behave as draws from $Binomial(2, MAF)$. Note that the standard QC procedure of filtering out genotypes that deviate from Hardy-Weinberg equilibrium is just a goodness-of-fit test against $Binomial(2, \hat{MAF})$. If you employ this filter, some LD-pruning, and still get different results, consider posting more details about the discrepancies you see - that would be indeed unexpected.

Thanks. I agree that independent SNPs behave as draws from Binomial. I think the discrepancies I observe is mainly on a population level. — Haohan Wang, Sep 01 '17 at 15:11

What will be an appropriate mathematical distribution for SNP data?

1 Answers1