Scaling by linear regression against the number of reads

Question

I am trying to build the preprocessing pipeline presented in The Tabula Muris Consortium et al. (pp).

It is a pipeline to preprocess single-cell sequencing data. There is one step that is not clear:

Counts were log-normalized (log(1 + counts per N)), then scaled by linear regression against the number of reads (or UMIs), the percent of reads mapping to Rn45s, and the percent of reads to ribosomal genes.

I understand the first part (I assume that log in this context is log2), but I need help on understanding how to scale by linear regression against the number of reads, the percent of reads mapping to Rn45s, and the percent of reads to ribosomal genes.

Have you contacted the paper authors? They would be able to give you a better answer, and your comments/questions would help them improve the paper when it is properly published. — gringer, Dec 27 '17 at 22:34
@gringer You are right, I posted here because I thought there was some standard procedure. — gc5, Dec 27 '17 at 22:52
The linear regression is straight-forward enough, but regressing against only three values (if I'm interpreting that correctly) is a recipe for disaster (it sounds vastly worse than even ERCC spike-ins...and those aren't exactly ideal). — Devon Ryan, Dec 28 '17 at 09:57
@DevonRyan ignoring by now the small number of regression values, how do you scale using the linear regression? I am unable to see how the linear regression is used in this case. Can you elaborate more (maybe with an answer)? — gc5, Dec 28 '17 at 15:00
Did you figure how did they preprocess the tabula muris data? — yuqi_yuqi, Jun 26 '18 at 21:18
@yuqi_yuqi see https://bioinformatics.stackexchange.com/a/3225/1771 — gc5, Jun 28 '18 at 15:20

score 3 · Accepted Answer · answered Apr 20 '18 at 07:52

I don't know if this question has been solved already, but what they try to do is equalize the depth of sequencing for each cell. Therefore, they scale for the total number of reads. If you regress out (via linear or negative binomial regression) the differences in the number of reads per cell, you end up with cells that have been sequenced with the same depth.

In my opinion they use the ribosomal genes in the same way. They are considered to be a kind of housekeeping genes, which you can use to equalize the sequencing depth.

And yes, the log in this context is log2, used to get fold-change values instead of counts.

Scaling by linear regression against the number of reads

1 Answers1