4

I have a data set similar to the one below. The real data set has 89 values in each column. I'm looking at the expression of RNA between two different treatments (treatment $X$ and treatment $Y$).


     GENE       Xtest1  Xtest2  Xtest3  Ytest1  Ytest2  Ytest3
  0  FOXO       34      193     12.0    23      23      1
  1  TP53       67      432     0.4     234     34      243
  2  LRU0046.3  21      543     234.0   545     6       65
  3  MUC2       768     346     12.0    23      3       4
  4  MUC16      100     234     456.0   435     234     243

I'm trying to work out the best statistics test to run in order to generate $P$-values and see if there is a significant difference between [Xtest1, Xtest2, Xtest3] vs [Ytest1, Ytest2, Ytest3]. I considered a $t$-test, however the distribution of the values is not within a Gaussian distribution. However, when I take the log of each value the Gaussian distribution is almost normal (the left tail is missing).

Can I get away with converting my data to log values (some downstream analysis actually will require me to convert to log values), or should I use a non-parametric test?

Thomas Bilach
  • 5,999
  • 2
  • 11
  • 33
Chip
  • 173
  • I'd suggest a nonparametric test. See here for an extensive conversation on this question: https://stats.stackexchange.com/questions/121852/how-to-choose-between-t-test-or-non-parametric-test-e-g-wilcoxon-in-small-sampl – num_39 Apr 16 '22 at 17:05

1 Answers1

5

A few comments:

  • It's probably fine to log-transform your data, if that helps with the analysis and it's common in your field. The results are relatively easy to interpret since they reference the geometric mean.

  • A more contemporary approach would be to use Gamma regression rather than log-transform your values.

  • Based on your design, it would make sense to take into account that several observations come from the same gene. Probably you would want to use a mixed effects model, and treat Gene as a random effect in the model. Minimally, you could include Gene as a fixed effect "block" in a general linear model.

The following R code uses your sample data. Here, I used a mixed effects model with Gamma regression. There's some code in the beginning to re-arrange the data frame.

### Transform data frame ###

Wide = read.table(header=TRUE, stringsAsFactors=TRUE, text=" Obs GENE Xtest1 Xtest2 Xtest3 Ytest1 Ytest2 Ytest3 0 FOXO 34 193 12.0 23 23 1 1 TP53 67 432 0.4 234 34 243 2 LRU0046.3 21 543 234.0 545 6 65 3 MUC2 768 346 12.0 23 3 4 4 MUC16 100 234 456.0 435 234 243 ")

library(tidyr)

Long = gather(Wide, Condition, Value, Xtest1:Ytest3, factor_key=TRUE)

Long$Treatment[substr(Long$Condition, 1, 5)=="Xtest"] = "X" Long$Treatment[substr(Long$Condition, 1, 5)=="Ytest"] = "Y"

Long$Treatment = factor(Long$Treatment)

Long

xtabs( ~ Treatment + GENE, data=Long)

Mixed-effects Gamma regression

library(lme4)

library(lmerTest)

model = glmer(Value ~ Treatment + (1|GENE), data=Long, family=Gamma())

summary(model)

hist(resid(model))

plot(predict(model), resid(model))

Sal Mangiafico
  • 11,330
  • 2
  • 15
  • 35