How is prior.count used by edgeR's cpm

Question

edgeR's cpm function has an argument called prior.count. Based on my understanding of the documentation, it is supposed to be adding a fixed number per sample which is proportional to the library size of said sample. Average of all numbers added to all samples would be equal to prior.count.

However, looking at actual data this does not seem to be the case. Given an imaginary data frame of

df = data.frame(a = c(1,2,3,0),c = c(1,2,3,0)*3, b = c(1,2,3,0)*2)

We can try to calculate log cpms by doing

logCPM = cpm(df,log = TRUE,prior.count = .5)

We can also calculate regular cpms by doing

CPM = cpm(df)

If we wanted to see, what is the number that is added to the CPMs before getting logged, we could do

difference = 2^logCPM - CPM

To make it look pretty lets use tibble

tibble::as.tibble(difference)

# A tibble: 4 x 3
                     a                    c                    b
                 <dbl>                <dbl>                <dbl>
1  25641                25641                25641              
2  12821                12821                12821              
3 -    0.0000000000582 -    0.0000000000582 -    0.0000000000582
4  38462                38462                38462

Here we see that the number that is added has no relation to the library size of the sample. What I want to learn is, how is this number that is added to each cpm is calculated based on prior.count.

I have been digging through the code but that part goes into C++ territory which hinders my understanding of what is going on

I get different results in the difference table. Which version are you using ? (I am in R 3.5.1 and edgeR 3.22.2) — llrs, Sep 06 '18 at 08:30

score 7 · Accepted Answer · answered Sep 06 '18 at 08:34

7

The prior count ends up getting scaled by the ratio of a library size to the average library size and then multiplied by 2 before getting added to each library size (I'm sure there's a good reason for that, but I don't know what it is). Using the example df data frame in your post, let's walk step-by-step through what cpm() is doing:

df = data.frame(a = c(1,2,3,0),c = c(1,2,3,0)*3, b = c(1,2,3,0)*2)
prior.count = 0.5
# First, we need to calculate a library size
lib.size = colSums(df)
# Calculate the average library size and the adjusted priors
ave.lib = mean(lib.size)
adjusted.prior = prior.count * lib.size / ave.lib
# Update the library sizes
adjusted.lib.size = lib.size + 2*adjusted.prior
# Now we can compute the CPM
customCPM = t((log(t(df) + adjusted.prior) - log(adjusted.lib.size) + log(1000000))/log(2))

The matrix transposition looks really ugly in the last step, but it's needed so R adds things like adjusted.prior to each row of df.

Note that the prior count is only used if you compute log(CPM), it's unused otherwise.

answered Sep 06 '18 at 08:34

Devon Ryan

19,602
2
29
60

How do you know that edgeR is scaling that way to obtain the adjusted.prior? By the way sweep could be used to avoid having to transpose the matrix – llrs Sep 06 '18 at 08:57
It's stated in the source code. I know I can use sweep(), but t() takes fewer characters. – Devon Ryan Sep 06 '18 at 09:46
Thanks a lot. By the way was I wrong about the code being within the C++ part? If that R code above was part of the package, I have missed it – OganM Sep 06 '18 at 22:08
1

You didn't miss it, I translated the C++ code to more accessible R. The C++ code is pretty tough to follow unless you're pretty comfortable with it. – Devon Ryan Sep 06 '18 at 22:22
thanks this is an awesome answer and helps me understand prior.count – Ahdee Jun 29 '19 at 02:34

How is prior.count used by edgeR's cpm

1 Answers1