3

I need to normalize a large (400Mb) dataset for doing PCA analysis. I want to use scran for doing that:

biocLite("SingleCellExperiment")
biocLite("scran")
library(SingleCellExperiment)
library(scran)
list_of_sce <- list()
# Looping though the UMI_count 'split_factor' columns at a time
split_factor = 500
for(i in seq(1,ncols, split_factor)) {
  num_loop = floor(i / split_factor) + 1
  idx = ncols
  if (i + split_factor < ncols) {
  idx = i + split_factor
  } 
  sce <- SingleCellExperiment(list(counts=UMI_count[, i : idx]))
  # Normalization of the dataset containing 
  # heterogenous cell data (different cell types)
  clusters <- quickCluster(sce) ## <- ** error appears here **
  sce <- computeSumFactors(sce, cluster=clusters)
  list_of_sce[[num_loop]] <- sce
}
cbind(list_of_sce)

However, when I try to run this it produces an error at the quickCluster step:

Error: cannot allocate vector of size 23.7 Gb

It feels like cbind at the end is not the right way to go, and I need to somehow combine sce and re-normalize them again somehow. The question is how to do that in a correct way: normalizing large dataset for PCA analysis when it can not be fully loaded into memory?

gringer
  • 14,012
  • 5
  • 23
  • 79
Nikita Vlasenko
  • 2,558
  • 3
  • 26
  • 38

1 Answers1

3

I normalized a high throughput dataset for a school project using DESeq library using the script bellow. The code is based on a lesson I had. My goal was determine the over expressed genes, but the normalization step should be the same.

#####################################################################
### DATA PRE-PROCESSING: Normalising gene expression distributions
#####################################################################
# Genes that are not expressed at a biologically meaningful level 
# in any condition should be discarded to reduce the subset of genes 
# to those that are of interest, and to reduce the number of tests carried out 
# downstream when looking at differential expression.
# Using a nominal CPM value of 1 (which is equivalent to a log-CPM value of 0 
# a gene is deemed to be expressed in a given sample if its transformed count 
# is above this threshold, and unexpressed otherwise. 
# Genes must be expressed in at least one group (or roughly any three samples 
# across the entire experiment) to be kept for downstream analysis.

library("DESeq")

keep.exprs <- rowSums(cpm>1)>=3 # is filtering is ok
DGEfiltered <- DGEobject[keep.exprs,, keep.lib.sizes = FALSE]

## Normalisation generally required to ensure that expression distributions
## of each sample are similar across the entire experiment.
## Any plot showing the per sample expression distributions, such as a density or boxplot,
## is useful in determining whether any samples are dissimilar to others. 
## Normalisation by the method of trimmed mean of M-values (TMM)
## is performed using the calcNormFactors function in edgeR.
# cpmF = filtered subset # log.cpmF = log cpm filtered subset
# Copy unnormalized, filtered data to show in boxplot
DGEf.notnorm <- DGEfiltered # normaliseerde 
DGEf.norm <- calcNormFactors(DGEfiltered, method = "TMM")
DGEf.norm$samples$norm.factors


# --> scaling factors relatively close to 1 --> mild TMM normalisation

## Boxplots of log-CPM values showing expression distributions 
## for unnormalised data (A) and normalised data (B) for each sample
par(mfrow = c(1,1))
par(mar = c(7,4,2,1))
# Boxplot unnormalised data
DGEf.notnorm$samples$norm.factors
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
log.cpmF.notnorm <- cpm(DGEf.notnorm, log = TRUE)
boxplot(log.cpmF.notnorm, las = 2, col = sample.col, cex = 0.9, 
        main = "A. Unnormalised data", ylab = "Log-cpm")
# Boxplot normalised data
DGEf.norm$samples$norm.factors

log.cpmF.norm <- cpm(DGEf.norm, log = TRUE)
boxplot(log.cpmF.norm, las = 2, col = sample.col, cex = 0.9, 
        main = "B. Normalised data", ylab = "Log-cpm")
# --> only when scaling factors are not close to 1 you will see a big difference
Kamil S Jaron
  • 5,542
  • 2
  • 25
  • 59
Nemo
  • 158
  • 7
  • Hello nemo, thanks for your answer, looks good. Could you add also some info where the script comes from? Maybe a reference or at least a bit of reasoning? I also believe that you should include some libraries. – Kamil S Jaron May 17 '18 at 14:31
  • I wrote it for the most part my self for a school project. The rest of the code is from the lesson I had. It was for an high throughput analysis on dataset for a disease. To determine the over expressed genes. – Nemo May 17 '18 at 15:29
  • Cool, if the materials of the class are available it would be good to link it, but that sounds already quite good to me. I tried to [edit] your post (& add there the info), check if it still makes sense and do not hesitate to make a further edit to fix it or to add more details/links. – Kamil S Jaron May 17 '18 at 16:52