Large dataset normalization for PCA

Question

I need to normalize a large (400Mb) dataset for doing PCA analysis. I want to use scran for doing that:

biocLite("SingleCellExperiment")
biocLite("scran")
library(SingleCellExperiment)
library(scran)
list_of_sce <- list()
# Looping though the UMI_count 'split_factor' columns at a time
split_factor = 500
for(i in seq(1,ncols, split_factor)) {
  num_loop = floor(i / split_factor) + 1
  idx = ncols
  if (i + split_factor < ncols) {
  idx = i + split_factor
  } 
  sce <- SingleCellExperiment(list(counts=UMI_count[, i : idx]))
  # Normalization of the dataset containing 
  # heterogenous cell data (different cell types)
  clusters <- quickCluster(sce) ## <- ** error appears here **
  sce <- computeSumFactors(sce, cluster=clusters)
  list_of_sce[[num_loop]] <- sce
}
cbind(list_of_sce)

However, when I try to run this it produces an error at the quickCluster step:

Error: cannot allocate vector of size 23.7 Gb

It feels like cbind at the end is not the right way to go, and I need to somehow combine sce and re-normalize them again somehow. The question is how to do that in a correct way: normalizing large dataset for PCA analysis when it can not be fully loaded into memory?

Which error do you get when you normalize the whole dataset? How big is your dataset? — llrs, Dec 19 '17 at 08:19

score 3 · Accepted Answer · edited May 17 '18 at 16:50

I normalized a high throughput dataset for a school project using DESeq library using the script bellow. The code is based on a lesson I had. My goal was determine the over expressed genes, but the normalization step should be the same.

#####################################################################
### DATA PRE-PROCESSING: Normalising gene expression distributions
#####################################################################
# Genes that are not expressed at a biologically meaningful level 
# in any condition should be discarded to reduce the subset of genes 
# to those that are of interest, and to reduce the number of tests carried out 
# downstream when looking at differential expression.
# Using a nominal CPM value of 1 (which is equivalent to a log-CPM value of 0 
# a gene is deemed to be expressed in a given sample if its transformed count 
# is above this threshold, and unexpressed otherwise. 
# Genes must be expressed in at least one group (or roughly any three samples 
# across the entire experiment) to be kept for downstream analysis.

library("DESeq")

keep.exprs <- rowSums(cpm>1)>=3 # is filtering is ok
DGEfiltered <- DGEobject[keep.exprs,, keep.lib.sizes = FALSE]

## Normalisation generally required to ensure that expression distributions
## of each sample are similar across the entire experiment.
## Any plot showing the per sample expression distributions, such as a density or boxplot,
## is useful in determining whether any samples are dissimilar to others. 
## Normalisation by the method of trimmed mean of M-values (TMM)
## is performed using the calcNormFactors function in edgeR.
# cpmF = filtered subset # log.cpmF = log cpm filtered subset
# Copy unnormalized, filtered data to show in boxplot
DGEf.notnorm <- DGEfiltered # normaliseerde 
DGEf.norm <- calcNormFactors(DGEfiltered, method = "TMM")
DGEf.norm$samples$norm.factors


# --> scaling factors relatively close to 1 --> mild TMM normalisation

## Boxplots of log-CPM values showing expression distributions 
## for unnormalised data (A) and normalised data (B) for each sample
par(mfrow = c(1,1))
par(mar = c(7,4,2,1))
# Boxplot unnormalised data
DGEf.notnorm$samples$norm.factors
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1
log.cpmF.notnorm <- cpm(DGEf.notnorm, log = TRUE)
boxplot(log.cpmF.notnorm, las = 2, col = sample.col, cex = 0.9, 
        main = "A. Unnormalised data", ylab = "Log-cpm")
# Boxplot normalised data
DGEf.norm$samples$norm.factors

log.cpmF.norm <- cpm(DGEf.norm, log = TRUE)
boxplot(log.cpmF.norm, las = 2, col = sample.col, cex = 0.9, 
        main = "B. Normalised data", ylab = "Log-cpm")
# --> only when scaling factors are not close to 1 you will see a big difference

Hello nemo, thanks for your answer, looks good. Could you add also some info where the script comes from? Maybe a reference or at least a bit of reasoning? I also believe that you should include some libraries. — Kamil S Jaron, May 17 '18 at 14:31
I wrote it for the most part my self for a school project. The rest of the code is from the lesson I had. It was for an high throughput analysis on dataset for a disease. To determine the over expressed genes. — Nemo, May 17 '18 at 15:29
Cool, if the materials of the class are available it would be good to link it, but that sounds already quite good to me. I tried to [edit] your post (& add there the info), check if it still makes sense and do not hesitate to make a further edit to fix it or to add more details/links. — Kamil S Jaron, May 17 '18 at 16:52

Large dataset normalization for PCA

1 Answers1

Linked