Improve scRNA-seq dataset for further analysis

Question

I got a dataset from C.Elegans scRNA-seq paper:

GSM2599701_Gene.count.matrix.celegans.cell.Rdata in GSE98561_RAW.tar

The dataset is 40 000 x 68 000, where rows represent genes and columns - cells. So, I took it and tried to process myself to build an scRNA-seq pipeline. Here is what I did:

I filtered out the genes that have zero counts in all of the cells and the dataset was reduced to 29 000 x 68 000
I removed all of the cells with counts < 100 in all of the genes - the dataset became 29 000 x 66 000
Then, because the dataset was too big to run normalization even on the cluster with 120Gb RAM (because there are multiple distinct types of cells, first clustering needs to be done), I selected just even columns and ran normalization with 29 000 x 33 000 dataset (UMI_count):
```
library(scran)     
library(scater)
sce <- newSCESet(countData=UMI_count)
clusters <- quickCluster(sce)
sce <- computeSumFactors(sce, clusters=clusters, positive=TRUE)
```

After running the code above I decided to check whether the data is fine and so I ran:

> summary(sizeFactors(sce))
Min.    1st Qu.  Median  Mean   3rd Qu.    Max. 
0.0000  0.0000  0.0000  0.0717  0.0000 33.3900

I also ran PCA on the normalized dataset and it looks like that:

It seems to me that the normalized dataset is terrible and I need to do some further processing before I could run further analysis. What else could I do to improve it? How to filter it? There are no spike-ins and maybe 200 mitochondrial genes. The approach described here does not work, probably because the majority of the cells have low number of genes expressed:

I tried removing low-abundance genes after normalization, but it seems that most of them are going to be removed:

>ave.counts <- rowMeans(counts(sce))
>keep <- ave.counts >= 1
>sum(keep)
109

Should I filter all of the cells - columns - for up to 500 total gene expression count instead of 100? Is it a good idea? I can't think of anything else.

Did you look up at the unnormalized data (via PCA to check if the normalization didn't handle these dataset correctly)? What kind of cells are there? Are they from several cell lines, stages, or disease? — llrs, Jan 06 '18 at 10:34
If you look at the samples section of that GSE, you can find that there are "c.elegans cells (L2 stage)", "HEK293T, HeLa S3 and NIH/3T3 cell mixture ", and "frozen fixed HEK293T and NIH/3T3 cell mix", so you might need to subset — llrs, Jan 06 '18 at 10:37
What happens if you drop the cells in the bottom 25 or 50% of reads sequenced (and maybe a few of the top ones too)? I suspect that anything that tightens the range of size factors is going to lead to cleaner results. — Devon Ryan, Jan 06 '18 at 12:45
Several cell lines, several dozen maybe. That is why need to do clustering first. These are all of the cells in C.Elegans. If you look into the raw compressed file, there you can find the dataset (see the top of the question). Before normalization PCA looked the following way: https://bioinformatics.stackexchange.com/questions/3115/filter-out-pca-outliers-automatically It definitely became better after normalization, if you look at the variance % of the PC1 and PC2, but not effective still — Nikita Vlasenko, Jan 07 '18 at 19:07
@DevonRyan I am not sure what you mean regarding size factors. Could you please elaborate on this? — Nikita Vlasenko, Jan 07 '18 at 19:15
You used the sizeFactors() function in your post, I mean its output. — Devon Ryan, Jan 07 '18 at 19:33

score 5 · Answer 1 · answered Jan 08 '18 at 03:45

I don't think you can conclude that the dataset is terrible based on that PCA. Depending on the specific protocol, each scRNA-seq dataset is going to be very different. Unlike bulk RNA-seq where all the samples are going to be of very similar quality, individual cells will be highly variable. For example, the most basic QC metric is the number of reads. If some of your bulk RNA-seq samples have 10X more reads than other samples, that's a cause for concern. In scRNA-seq experiments, that's not at all surprising. The PCA exposes the major sources of variation. In scRNA-seq experiment, those are probably going to be technical rather than biological, so the output may not make much sense. Furthermore, even if all the cells are of perfect quality, PCA is not really optimal for sparse data with many dropouts like you see with scRNA-seq.

Regardless of PCA, those library sizes seem very small. Based on your plot, there are barely any cells with over 10K reads. I understand this is C elegans, but that seems somewhat low. That may be why you are having trouble with normalization. It would help with understanding the data if you changed the X-axis on those plots so that you can actually see where the peak is. Right now, it looks like 0, but that can't be the case.

score 3 · Accepted Answer · answered Jan 09 '18 at 09:18

I usually use a minimum of 200 genes/cell for relaxed filters (500 if possible). For genes I use presence in at least 10% of the cells.

A few other filters you can try are:

High percentage of reads from only top50 genes. This means you haven't captured a lot of genes and the diversity of the cell won't be useful.
High percentage of mitochondrial genes. This has been reported to be linked to bad quality cells. The idea is that mRNA is depleted because of early cell death (or other outer membrane injury) leading to having only mito mRNA left.
Minimum UMI per cells. Similar to minimum genes but more dependent of the protocol used.
UMI per gene/cell. This can be interesting to detect a potential bias from amplification and is also dependent on the protocol.

The best would be to compare all your samples using violin plots with those different filters and choose your thresholds from there. A quick example (using Seurat):

Pre filtering:

Post filtering:

I like that top 50 fraction metric. Any idea why it's so clearly bimodal? Does it correlate with percent mitochondrial? — burger, Jan 09 '18 at 17:47
Most of them yes. I haven't looked at them specifically but I'm pretty sure a few other cells are also filtered with this one. — Patrick Roelli, Jan 10 '18 at 08:32

score 1 · Answer 3 · answered Jan 07 '18 at 23:28

Given that this is single-cell sequencing results, you should be using rowSums instead of rowMeans for filtering genes. Most genes won't be expressed at a single cell level, but there should be a lot of genes that have some expression. The zero counts are probably better represented as "missing" rather than zero.

Unfortunately that feature makes running PCAs difficult, because PCAs don't work with missing values. The missing values may need to be imputed in order to generate a better PCA.

disclaimer: I don't [typically] work with single cell data, so don't have a good handle on the methods for looking at it.

Improve scRNA-seq dataset for further analysis

3 Answers3

Linked