6

I am preprocessing scRNA-seq data. What is the best practice in use to run both ComBat for batch effects removal, data imputation (to mitigate dropout) and library size normalization?

I thought that library size should be run first, since it is per-cell normalization, then ComBat batch effects removal. On the original paper - Johnson et al. (2007) - it is stated that:

We assume that the data have been normalized and expression values have been estimated for all genes and samples.

However, I want to apply it to scRNA-seq data. Does this statement still hold? Additionally, I plan to apply imputation (e.g. with MAGIC) in the end. Is there any problem you can spot?

Update

I attach the PCA regarding an example Mus Musculus dataset in which different colors represent different mice. It seems clear to me that the first two principal components are affected by batches (mouse id).

pca

Update 2

I rerun the PCA on raw counts data (the first PCA was on log-transformed data) and I obtain a different description of the dataset, in which batch effects seem not to be prevalent.

pca_raw

gc5
  • 1,783
  • 18
  • 32
  • From what I can tell, MAGIC should be run on raw data, so that would be the first step. – burger Jan 04 '18 at 00:00
  • @burger MAGIC normalizes data before imputation, so it should be run at least after library size normalization. My concern is that using MAGIC before ComBat will amplify batch effects. Reading the paper I could not find any reference to batch effect removal. – gc5 Jan 04 '18 at 16:32
  • The advice I got was that the best would be to adjust for batch effect instead of removing them. Did you tried adjusting for your batch effects? How big is your batch effect ? (Do the PCA or MDS or dendograms show a clear distinction by your batch effect (or several batchs) ?) – llrs Jan 10 '18 at 09:39
  • @Llopis yes, actually for batch effects removal I meant adjusting for batch effect with ComBat, is it what you meant? – gc5 Jan 10 '18 at 14:34
  • No, comBat doesn't adjust for batch effect it "removes" it (Despite the first line of the help page). From the (same) help page: "Users are returned an expression matrix that has been corrected for batch effects"; it modifies the data to "adjust" for it, instead of adding/calculating a factor to be taken into account for later steps. The later can be done in limma, DESeq2 and other packages but it is not the same adjusting than removing. – llrs Jan 10 '18 at 14:57
  • @Llopis ok thanks, I didn't know this distinction. However, my PCA shows a clear distinction between batches. I'll update the question with the figure. Can you please elaborate more on calculating a factor to be taken into account for later steps? Do you mean to extract the principal component correlated with batch and later do something with it? – gc5 Jan 10 '18 at 15:03
  • Well, what I do is include the known batch effects on the linear models. That could be via the component of the PCA or the categories you know of your batches. Could you expand into what are your batches ? What are 3_8, 3_38 ... (I assume M is male and F is Female).? – llrs Jan 10 '18 at 15:06
  • @Llopis unfortunately I have no metadata information regarding the first part of the mouse id. They only provide sex (M/F) which is as you assumed. – gc5 Jan 10 '18 at 15:20
  • Your PC2 separates cells of the mice but is only 0.6% of the variation, so I would say there isn't batch effect. The first dimension is quite high, but I don't know if this is normal in scRNA-seq. I wouldn't adjust nor remove batch effect here if this would be RNA-seq. But I have never analysed scRNA – llrs Jan 10 '18 at 15:24
  • 1
    in my experience, the absolute first thing you need to do is normalize for library size. I suspect that if you color your cells according to size you will notice a clear correlation with PC1. – galicae Jan 25 '18 at 15:01

1 Answers1

3

MAGIC assumes input data has been both library-size normalized, and either log or sqrt transformed prior to imputation (see also: MAGIC tutorial). Additionally, any graph-based methods (MAGIC, PHATE, t-SNE, UMAP, spectral clustering, Louvain, etc etc) will give flawed results if your data contains a batch effect, since the neighbourhood graph would reflect that structure of your batch effect, and worse, imputation would further reinforce this batch effect.

Thus I would recommend the following pipeline:

  • Library-size normalization
  • Square root (or log) transform
  • Batch effect removal
  • Imputation

Regarding your update, the reason you don't see the batch effect in the raw counts data is simply that the batch effect is not visible in the most highly expressed genes. Prior to transformation, the principal source of variation in your data is simply the expression of the most highly expressed genes - this is essentially masking the source of the batch effect, not removing it. I recommend never working with raw molecule counts in scRNAseq, as the raw counts data hides much of the heterogenity in your dataset, which is precisely what you are looking in doing single-cell RNA-seq.

Scott Gigante
  • 2,133
  • 1
  • 13
  • 32