0

I'm currently working on a logistic regression problem with case-control response variable and SNP genotype as regressor. Can I treat 0, 1, 2 as continuous instead of factor?

Hongqi
  • 23
  • 4

2 Answers2

2

You can, these two approaches have different implicit assumptions. Treating it as 0/1/2 is essentially the dosage model, where you assuming have two of the minor alleles (homozygote minor) is twice the effect on the phenotype as having one of the minor alleles (heterozygote), and zero (homozygote major) is no effect.

Treating as a factor is not making this dosage assumption, since as a factor each level "0", "1", "2" can have different effects and they don't have to fall on a line like the dosage model does, i.e., you can have a model like "0"->strong effect, "1"->weak effect, "2"->strong effect. So you can discover non-dosage effects with this approach.

purple51
  • 1,687
  • Thank you, that's been very helpful. Can you also recommend a textbook with the above information, perhaps with examples in R? – Hongqi Aug 05 '14 at 08:08
  • The book "Applied Statistical Genetics with R" by Andrea Foulkes could be useful, but I don't have a copy in front of me to check whether it discusses dosage models and coding. – purple51 Aug 05 '14 at 10:27
  • Yes, it does (when describing the Cochran-Armitage trend test, pp. 42-43). – chl Aug 06 '14 at 07:02
2

Adding to what purple51 already said, SNP data can be coded in several ways, in general. The usual way to numerically code a SNP is to indicate the number of variant/recessive alleles {0,1,2}, but this is not always the case.

As for coding for analysis purposes, first of all, there are several factor coding possibilities. You can have a genotype model, where you have three factor levels: 0, 1, and 2. Genotype model can be "reduced" to A) a recessive model , where you have a coding of {0, 1} -> 0 and {2} -> 1 or B) a dominant model, where you have the coding {0} -> 0 and {1,2} -> 1. And you can have the additive (dosage) model, where you essentially use the number of variant alleles as a "continuous" variable in a regression analysis. These all test different hypotheses of how the SNP might be associated with your outcome of interest.

Note that this is not an exhaustive list of coding possibilities. Just the most used ones are mentioned.

  • Thank you so much! Can you also recommend a textbook with such information, perhaps with R applications? – Hongqi Aug 05 '14 at 08:07