16

What the difference between TPM and CPM when dealing with RNA seq data?

What metrics would you use if you have to perform some down stream analysis other than Differential expression for eg.

Clustering analysis using Hclust function and then plotting heat map to find differences in terms of expression levels, correlation and pca

Is it wrong to use TPM for such analysis, if yes then when does one use TPM versus CPM.

2 Answers2

13

You can find the various equations in this oft-cited blog post from Harold Pimentel. CPM is basically depth-normalized counts, whereas TPM is length-normalized (and then normalized by the length-normalized values of the other genes).

If one has to choose between those two choices one typically chooses TPM for most things, since generally the length normalization is handy. Realistically, you probably want log(TPM) since otherwise noise in your most highly expressed genes dominates over small expression signals.

Devon Ryan
  • 19,602
  • 2
  • 29
  • 60
  • in a given case if one would trim adapters from paired end sequenced rna seq data, this would result in different read length, as you say above TPM is length normalized does it mean that this difference in read length is taken into consideration? – novicebioinforesearcher Aug 14 '17 at 21:40
  • 1
    @novicebioinforesearcher No, transcript/gene length, which correlates (somewhat) with counts and therefore will tend to drive clustering unless handled in a reasonable way. – Devon Ryan Aug 15 '17 at 07:04
  • 2
    Rob Patro also wrote a pretty good article about the topic as well: http://robpatro.com/blog/?p=235 – story Aug 17 '17 at 14:35
  • [link] (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8220791/) – envs_h_gang_5 Jun 21 '22 at 04:27
3

Neither CPM nor TPM are well suited here, because neither performs robust cross-sample normalisation (see the blog post Devon linked to).

DESeq2 provides two robust log-space normalisation methods for downstream analysis, the regularised log (rlog), and the variance stabilising transformation (vst). The DESeq2 vignette explains how to use these for things like hclust.


On a more general note, CPM does not account for transcript length differences, while TPM does. If the choice is between TPM and CPM I would therefore use TPM. However, if you are only comparing the same transcripts across experiments, the transcript length is actually invariant so it doesn’t matter (but CPM is still not a good cross-experiment normalisation).

Konrad Rudolph
  • 4,845
  • 14
  • 45
  • I am confused now , so what is the use of TPM, why does one produce it and when or where do you use it? , in other words which tools, analysis in RNA seq would you use TPM if everything revolves around using counts and pushing it through DESeq2 – novicebioinforesearcher Aug 15 '17 at 12:25
  • 2
    Tools produce TPMs because they don’t have the information (= the other samples) necessary to perform cross-sample normalisation. Lacking that, TPM is the best they can do. TPM is also useful for within-sample comparisons: It can give you an accurate estimate of how much genes are expressed in a given sample relative to each other. – Konrad Rudolph Aug 15 '17 at 12:29
  • Plz Correct me if am wrong here, given an experimental design different cell types from a normal mouse say 4 cell types (3 replicates each), sequenced using same library prep but may be at different times. Aim would be to check for set of cell type specific transcripts you would use TPM where as if you want to add a significance parameter (need a pvalue) one would use raw counts based analysis? I guess the confusion for me is when we use the word "expression" . People use TPM and call it expression, also use raw counts and call it expression – novicebioinforesearcher Aug 15 '17 at 12:40
  • 2
    Both are estimates for expression, given the data. Your use-case sounds reasonable although I would generally prefer to determine “cell type specific transcripts” by comparing different cell types, rather than based solely on a single sample. Which would imply performing differential expression analysis. – Konrad Rudolph Aug 15 '17 at 13:24
  • Which would imply performing differential expression analysis on? – novicebioinforesearcher Aug 15 '17 at 15:08
  • 2
    @novicebioinforesearcher On whatever data sets you wish to compare. It rarely (if ever!) makes sense to describe a gene as being cell type specific without saying “as opposed to these other cell types”. For example, a cell type specific gene may nevertheless be lowly expressed: as long as it’s entirely absent in other cells, it’s cell type specific. This is in fact often the case. You therefore cannot characterise many cell type specific genes without comparing different cell types. – Konrad Rudolph Aug 15 '17 at 17:12
  • Is VST/RLOG normalised for transcript length? If not is it possible to get length normalised VST/RLOG? – mindlessgreen Oct 05 '18 at 10:15
  • @rmf No, they do not normalise for transcript length; both functions merely change the distribution shape of the counts into something closer to linear. For those purposes where you’d use rlog/vst, accounting for transcript length, is normally not important. You could however apply further transformation (let’s call it rlog-TMP), if you have an application where you need cross-sample as well as within-sample normalised values. – Konrad Rudolph Oct 05 '18 at 12:25
  • Would you say it's ok to do something like (vst/length)*(10^6) and use that for heatmaps where I actually want to compare expression of one gene to another gene. – mindlessgreen Oct 05 '18 at 16:58
  • 1
    @rmf Yes but in a heatmap you usually scale by (gene) row anyway (either explicitly or the plotting function performs the division internally) so the normalisation per transcript length will be strictly a no-op. – Konrad Rudolph Oct 05 '18 at 18:04