Which measure should be used in a PCA or RNA-seq data? TPM or counts?

Question

I'm trying to understand the magnitude of batch effects in my RNA-seq samples, and I was wondering which expression units are more suitable to draw a PCA. I'm thinking of either counts or TPM, but things like rlog or vst could work too.

Additionally, I'm wondering whether either of this units should be log-transformed first, to avoid high-abundance transcripts driving the PCA.

Devon Ryan · Accepted Answer · 2017-08-18T10:57:03.527

tldr: log transform counts and TPMs, but rlog/vst are preferred

TPM should be log transformed to get more useful results. If you're using DESeq2 already (given the reference to rlog and vst, this seems likely), then please go ahead and use rlog or vst. That will give you more reasonable results than raw counts. If you're stuck with counts for some reason, then first use normalized counts so they're at least a bit more comparable and then log transform them so your highly expressed genes aren't driving everything.

Edit: As an aside, if you know what the batch effect is (e.g., library prep. date), it's sometimes convenient to include that in your model. You can then assess the genes that are actually changed due to that, which is sometimes handy to know (e.g., which genes might be more/less prone to degradation).

by "normalized counts" you mean something like quantile-normalization or batch-effect removal? — mgalardini, Aug 18 '17 at 10:20
@mgalardini Quantile normalized, or whatever method you prefer (e.g., the ones use by default be DESeq2/edgeR/limma). — Devon Ryan, Aug 18 '17 at 10:55

gringer · Answer 2 · 2017-08-18T21:13:03.153

4

PCA works best when the input data is approximately normally distributed on each dimension. It would be a good idea to do some initial data quality checks to verify that this is the case (and transform the data appropriately if not), or at least verify that the data is approximately normally distributed in the aggregate.

For looking at Illumina RNASeq data, what worked best for me (i.e. produced the most normal-looking data) was the following steps:

Removing genes that had low raw counts in all samples
Using DESeq's variance-stabilized transform (which transforms counts into a log-like distribution)
Further normalising the VST values by dividing by the longest transcript length within each gene (which I call VSTPk)

These steps are stated in a bit more detail in our Th2 paper that was published at the end of last year:

http://jem.rupress.org/content/early/2016/12/01/jem.20160470#materials-methods

edited Aug 18 '17 at 21:13

answered Aug 18 '17 at 11:13

gringer

14,012
5
23
79

Given that I'm working with bacteria I will probably need to skip step 3. Thanks for the insight! – mgalardini Aug 18 '17 at 11:17
1

Do you have a reference for the claim that PCA assumes normality, because its not something I've come across before, and indeed, many sources say it is not. e.g. https://stats.stackexchange.com/questions/32105/pca-of-non-gaussian-data – Ian Sudbery Aug 18 '17 at 12:07
1

I'll preface this with "I am not a statistician", and am basing this on my memories of conversations I had with a biostatistician I worked with. The operations carried out for a PCA assume that things like mean and variance and euclidean distance work normally and predictably; a grossly non-normal distribution can affect this. PCA is fairly robust to non-normal distributions, but not completely immune. – gringer Aug 18 '17 at 21:19

Which measure should be used in a PCA or RNA-seq data? TPM or counts?

2 Answers2

Linked