9

A popular framework to analyze differences between groups, either experiments or diseases, in transcriptomics is using linear models (limma is a popular choice).

For instance we have a disease D with three stages as defined by clinicians, A, B and C. 10 samples each stage and the healthy H to compare with is RNA-sequenced. A typical linear model would be to observe the three stages~A+B+C independently. The data of each stage is not from the same person. (but for the question assume it isn't)

My understanding is that such a model would not take into account that stage C appears only on 30% of patients in stage B. And that a healthy patient upon external factors can jump to stage B.

If we want to find the role of a gene in the disease we should include somehow this information in the model. Which makes me think about mixing linear models and hidden Markov chains.

How can such a disease be described in terms of linear models with such data and information?

llrs
  • 4,693
  • 1
  • 18
  • 42
  • Of course, your model is not taking into account that observations (i.e 30% of patients from B evolve to C). Here, limma would look for genes with most significant differences among states based on F-statistic. If you are more interested in disease progression, you should read more about time-course patterns (states follow a progression, so positions are not transposable). – Curro Campos May 17 '17 at 09:08
  • But time-course patterns wouldn't handle the jump from healthy to B. A quick look for disease time course patterns show the top hit with an article from 2001. Could you point me to a review or an introductory title – llrs May 17 '17 at 09:24
  • 1
    Could you clarify (probably in the question) if stages A, B & C come from the same individuals? – Iakov Davydov May 17 '17 at 12:30
  • @IakovDavydov Done, the stages are from the disease, the data might come from the same individuals or not. But we can assume the data from each stage doesn't come from the same individuals – llrs May 17 '17 at 13:09
  • Why do you think it's useful to include the progression information in the model ("we should include somehow this information in the model"). You're only able to discern when/if the gene is changing and no statistical model will change that. – Devon Ryan May 17 '17 at 13:18
  • What about saying that if you progressed to B from C, then you are positive for both those variables, rather than that you just have B? – Ian Sudbery May 17 '17 at 13:24
  • @DevonRyan My rationale behind is the following: imagine that a gene between B and C is found to change the expression. Is the change in the gene expression a consequence or a reason of changing the stage? – llrs May 17 '17 at 13:57
  • @IanSudbery I thought about that possibility but clinicians don't think that way. But I would actually like to test this model with the dataset I have at hands. – llrs May 17 '17 at 13:58
  • @Llopis You're not going to get causality from this sort of data. – Devon Ryan May 17 '17 at 14:11
  • 3
    I don't understand your description of the linear model as ~A+B+C. If these are indeed stages of a disease - then the model would rightly just be ~Disease. And the levels of the data column or vector would be the factors (H,A,B,C) - where H is reset to the reference level. – Stephen Henderson May 17 '17 at 14:29
  • @StephenHenderson Doing that way the model I couldn't compare between stages. – llrs May 17 '17 at 14:46
  • @Llopis I believe you are confused and have that the wrong way round. I would refer you to section 9.3 page 42 of the limma manual on dealing with several groups. Note the actual congrats of interest are specified separately.. https://www.bioconductor.org/packages/devel/bioc/vignettes/limma/inst/doc/usersguide.pdf – Stephen Henderson May 17 '17 at 14:59
  • 2
    @StephenHenderson AFAIK, section 9.3 use ~0+f (where f is the factor with the groups) which is the same as ~A+B+C (that's why later the columns are renamed to RNA1, RNA2, RNA3 in the tutorial, if It had one Disease/Group column how could it rename three columns?) – llrs May 17 '17 at 15:27
  • @Llopis that's the design matrix - 3 columns for 3 groups in the single covariate (f). ~A+B+C is like a model with 3 covariates Disease+Sex+Age... see for instance section 9.42 on the next page 43. – Stephen Henderson May 17 '17 at 18:29
  • @Stephen The same design matrix can be build of different ways. I think we are talking about the same model but we build the design matrix differently. But that would be another question IMHO – llrs May 18 '17 at 06:45
  • Like others, I am confused by your model. You say A, B, C are not all observed in a given sample - some individuals never make it to C. Whether C is binary or a quantitative trait, if it never happens for some individuals, then no data is observed that would enter the model. These sample’s estimates would be functions of their observations of A, B. If you wish to more strongly impact the model estimation process, you can look at like a general likelihood, where you add priors to model coefficients, as well as encode other prior information. – learner May 18 '17 at 11:56

1 Answers1

4

There are two potential sources of bias in this design.

  1. We cannot distinguish correlation from causation.

Imagine two cases. In the first, the disease progression is inducing immune response. Later stages will be associated with the higher gene expression levels. In the second scenario, the disease is caused by overexpression of a gene. Later stages will be also associated with the higher expression.

This is typical for observational studies. But I just want to mention that a special care should be taken during interpretation of the results.

  1. If we are not following our individuals, we cannot distinguish correlation and avoidance.

Let's say the disease is lethal in certain cases; the survival is negatively correlated with the disease stage. Now imagine there is a gene, which causes severe symptoms when highly expressed. On the later stages you will only observe those patients in which the gene was not highly expressed. From that you would conclude that gene expression is decreasing with the disease progression. In reality this gene is very important and causal, you just do not have patients which are alive to see this.

This is similar to Wald's studies of aircrafts (Survival bias).

Researchers from the Center for Naval Analyses had conducted a study of the damage done to aircraft that had returned from missions, and had recommended that armor be added to the areas that showed the most damage. Wald proposed that the Navy instead reinforce the areas where the returning aircraft were unscathed, since those were the areas that, if hit, would cause the plane to be lost.

I think the second point is crucial, and can and will lead to false conclusions. I do not see immediately how to avoid this bias if you study different individuals.

Therefore I suggest that the same individuals are followed for a long time.

There are different approaches you can use later. For example you can have two-stage procedure:

  1. Identify genes which are differentially expressed between healthy (H) and sick (A, B or C).
  2. Build a linear model of disease stage stage ~ gene1 + gene2 + ..., using genes identified at step 1.
  3. Similarly build a linear model of survival as a function gene expression.
  4. It is possible you can use logistic regression to infer probability of transition from stage B to stage C.

This is not the only possible approach of course. As you suggested, Markov model is also applicable.

While following the same individuals it is possible to use continuous-time Markov model for such a study. In this case discrete or continuous measurements are recorded at certain time points, and the model parameters are inferred using maximum likelihood.

I'm not an expert in this field, but I think this paper describing the msm package for R will be useful. It also supports hidden Markov models.

Iakov Davydov
  • 2,695
  • 1
  • 13
  • 34
  • You explained better than myself why this model is not good to find causation :). However in your proposed solutions I have two problems. 1) In certain diseases is not easy to follow the same individuals for a long time (for instance you might have to operate to extract the data, so he/she can't be operated every year). 2) Your two models +survival analysis doesn't take into account that stage C appears on only 30% of patients in stage B. – llrs May 18 '17 at 06:54
  • I agree that it is not always possible, although I do not immediately see a way how to overcome this bias. 2) I expanded question on this (see point #4 and msm).
  • – Iakov Davydov May 19 '17 at 11:59
  • Ok, I guess there isn't a way to mix the data in linear models :( – llrs May 19 '17 at 12:28