Should I use a two-tailed t-test to generate p-values?

Question

I have a data set similar to the one below. The real data set has 89 values in each column. I'm looking at the expression of RNA between two different treatments (treatment $X$ and treatment $Y$).


     GENE       Xtest1  Xtest2  Xtest3  Ytest1  Ytest2  Ytest3
  0  FOXO       34      193     12.0    23      23      1
  1  TP53       67      432     0.4     234     34      243
  2  LRU0046.3  21      543     234.0   545     6       65
  3  MUC2       768     346     12.0    23      3       4
  4  MUC16      100     234     456.0   435     234     243

I'm trying to work out the best statistics test to run in order to generate $P$-values and see if there is a significant difference between [Xtest1, Xtest2, Xtest3] vs [Ytest1, Ytest2, Ytest3]. I considered a $t$-test, however the distribution of the values is not within a Gaussian distribution. However, when I take the log of each value the Gaussian distribution is almost normal (the left tail is missing).

Can I get away with converting my data to log values (some downstream analysis actually will require me to convert to log values), or should I use a non-parametric test?

I'd suggest a nonparametric test. See here for an extensive conversation on this question: https://stats.stackexchange.com/questions/121852/how-to-choose-between-t-test-or-non-parametric-test-e-g-wilcoxon-in-small-sampl — num_39, Apr 16 '22 at 17:05

score 5 · Accepted Answer · answered Apr 17 '22 at 15:07

A few comments:

It's probably fine to log-transform your data, if that helps with the analysis and it's common in your field. The results are relatively easy to interpret since they reference the geometric mean.
A more contemporary approach would be to use Gamma regression rather than log-transform your values.
Based on your design, it would make sense to take into account that several observations come from the same gene. Probably you would want to use a mixed effects model, and treat Gene as a random effect in the model. Minimally, you could include Gene as a fixed effect "block" in a general linear model.

The following R code uses your sample data. Here, I used a mixed effects model with Gamma regression. There's some code in the beginning to re-arrange the data frame.

### Transform data frame ###
Wide = read.table(header=TRUE, stringsAsFactors=TRUE, text="
Obs GENE       Xtest1  Xtest2  Xtest3  Ytest1  Ytest2  Ytest3
0  FOXO       34      193     12.0    23      23      1
1  TP53       67      432     0.4     234     34      243
2  LRU0046.3  21      543     234.0   545     6       65
3  MUC2       768     346     12.0    23      3       4
4  MUC16      100     234     456.0   435     234     243
")
library(tidyr)
Long =  gather(Wide, Condition, Value, Xtest1:Ytest3, factor_key=TRUE)
Long$Treatment[substr(Long$Condition, 1, 5)=="Xtest"] = "X"
Long$Treatment[substr(Long$Condition, 1, 5)=="Ytest"] = "Y"
Long$Treatment = factor(Long$Treatment)
Long
xtabs( ~ Treatment + GENE, data=Long)
Mixed-effects Gamma regression
library(lme4)
library(lmerTest)
model = glmer(Value ~ Treatment + (1|GENE), data=Long, family=Gamma())
summary(model)
hist(resid(model))
plot(predict(model), resid(model))

Should I use a two-tailed t-test to generate p-values?

1 Answers1

Mixed-effects Gamma regression