6

I downloaded gene expression data (exp_seq) from the ICGC file browser.

For each sample and gene, the file contains a normalized_read_count.

What is that value? I couldn't find any information on the ICGC website. The values are definitly too low for TPM.

Gregor Sturm
  • 273
  • 1
  • 6

1 Answers1

5

By reading this thread on seqanswers and by comparing the data to TCGA, I figured out

  • raw_read_count is the read count which you use as input for e.g. DESeq2. It has been estimated using RSEM
  • normalized_read_count is equivalent to the scaled_estimate from TCGA. This is the estimated fraction of transcripts made up by a given gene, as estimated by RSEM. Multiplying this value with 1e6 yields the TPM.
Gregor Sturm
  • 273
  • 1
  • 6
  • 1
    Glad you could find it yourself! Many thanks for posting the answer – llrs Jan 10 '18 at 15:36
  • Thank you for your answer. This is an essential information for a user of this data. I wonder why this information on the ICGC website itself though. I found a page describing the columns on github https://github.com/icgc-dcc/dcc-docs/blob/master/docs/dictionary/release-20/sequencing-based-gene-expression-expseq-primary-file-p.md. It does not explicitly say what normalized_read_count are. Also the page corresponds to one of the older release (20), not the current one (28). – user345394 Dec 30 '21 at 17:48