There are two potential sources of bias in this design.
- We cannot distinguish correlation from causation.
Imagine two cases. In the first, the disease progression is inducing immune response. Later stages will be associated with the higher gene expression levels. In the second scenario, the disease is caused by overexpression of a gene. Later stages will be also associated with the higher expression.
This is typical for observational studies. But I just want to mention that a special care should be taken during interpretation of the results.
- If we are not following our individuals, we cannot distinguish correlation and avoidance.
Let's say the disease is lethal in certain cases; the survival is negatively correlated with the disease stage. Now imagine there is a gene, which causes severe symptoms when highly expressed. On the later stages you will only observe those patients in which the gene was not highly expressed. From that you would conclude that gene expression is decreasing with the disease progression. In reality this gene is very important and causal, you just do not have patients which are alive to see this.
This is similar to Wald's studies of aircrafts (Survival bias).
Researchers from the Center for Naval Analyses had conducted a study
of the damage done to aircraft that had returned from missions, and
had recommended that armor be added to the areas that showed the most
damage.
Wald proposed that the Navy instead reinforce the areas where the
returning aircraft were unscathed, since those were the areas that, if
hit, would cause the plane to be lost.
I think the second point is crucial, and can and will lead to false conclusions. I do not see immediately how to avoid this bias if you study different individuals.
Therefore I suggest that the same individuals are followed for a long time.
There are different approaches you can use later. For example you can have two-stage procedure:
- Identify genes which are differentially expressed between healthy (H) and sick (A, B or C).
- Build a linear model of disease stage stage ~ gene1 + gene2 + ..., using genes identified at step 1.
- Similarly build a linear model of survival as a function gene expression.
- It is possible you can use logistic regression to infer probability of transition from stage B to stage C.
This is not the only possible approach of course. As you suggested, Markov model is also applicable.
While following the same individuals it is possible to use continuous-time Markov model for such a study. In this case discrete or continuous measurements are recorded at certain time points, and the model parameters are inferred using maximum likelihood.
I'm not an expert in this field, but I think this paper describing the msm package for R will be useful. It also supports hidden Markov models.
~0+f(where f is the factor with the groups) which is the same as~A+B+C(that's why later the columns are renamed to RNA1, RNA2, RNA3 in the tutorial, if It had one Disease/Group column how could it rename three columns?) – llrs May 17 '17 at 15:27A, B, Care not all observed in a given sample - some individuals never make it toC. WhetherCis binary or a quantitative trait, if it never happens for some individuals, then no data is observed that would enter the model. These sample’s estimates would be functions of their observations ofA, B. If you wish to more strongly impact the model estimation process, you can look at like a general likelihood, where you add priors to model coefficients, as well as encode other prior information. – learner May 18 '17 at 11:56