edgeR's cpm function has an argument called prior.count. Based on my understanding of the documentation, it is supposed to be adding a fixed number per sample which is proportional to the library size of said sample. Average of all numbers added to all samples would be equal to prior.count.
However, looking at actual data this does not seem to be the case. Given an imaginary data frame of
df = data.frame(a = c(1,2,3,0),c = c(1,2,3,0)*3, b = c(1,2,3,0)*2)
We can try to calculate log cpms by doing
logCPM = cpm(df,log = TRUE,prior.count = .5)
We can also calculate regular cpms by doing
CPM = cpm(df)
If we wanted to see, what is the number that is added to the CPMs before getting logged, we could do
difference = 2^logCPM - CPM
To make it look pretty lets use tibble
tibble::as.tibble(difference)
# A tibble: 4 x 3
a c b
<dbl> <dbl> <dbl>
1 25641 25641 25641
2 12821 12821 12821
3 - 0.0000000000582 - 0.0000000000582 - 0.0000000000582
4 38462 38462 38462
Here we see that the number that is added has no relation to the library size of the sample. What I want to learn is, how is this number that is added to each cpm is calculated based on prior.count.
I have been digging through the code but that part goes into C++ territory which hinders my understanding of what is going on
differencetable. Which version are you using ? (I am in R 3.5.1 and edgeR 3.22.2) – llrs Sep 06 '18 at 08:30