3

Suppose my genome is 3 million bases and that my reads are 100 nucleotide long. I need to know how many reads I need to cover the entire genome.

I start from using the equation $C = \frac{N \cdot L}{G}$ where C is the coverage, N the number of reads, L the length of a read and G the length of the haploid genome. I also know that the average coverage follows a Poisson distribution, so I need this for C.

On my slides, in fact, I have the following:

$$ C = \frac{N \cdot L}{G} \approx ln \left( \frac{G}{L \cdot \epsilon}\right), $$

if we assume that this is the average coverage needed to cover the entire genome with probability $1-\epsilon$.

The problem is that I do not understand how to arrive at the expression on the right. I thought that if we want to be sure to have a coverage of at least 1 on average, we would have had to compute $1 - P(X=0) = 1 - e^{-\frac{N \cdot L}{G}} = 0.99$, if we take $\epsilon=0.01$. But I do not get the same result as with the equation above.

wrong_path
  • 391
  • 1
  • 7

1 Answers1

3

Let's see what your slides are claiming $\epsilon$ to be, using:

$$ C = \frac{NL}{G} \approx \ln \left( \frac{G}{L\epsilon} \right) $$

We can rearrange to get

$$ \epsilon \approx \frac{G}{L} \times e^{-C} $$

Now, what does it mean for an entire genome to be covered? Having a clear understanding of this definition is crucial, and your misunderstanding seems to partially stem from this.

I claim it's natural to consider a genome as consisting of $G/L$ bins. In this setting, we have $N$ balls that each randomly goes into a bin. Then, the entire genome is covered when all bins contain at least $1$ ball/read.

From the Poisson distribution, we know that P(a bin has 0 balls) = $e^{-C}$.

Then, we define $\epsilon$ to be the probability that at least one of the $G/L$ bins has 0 balls. Given the way we've modeled the problem, this is exactly equal to

$$ \epsilon = 1 - (1 - e^{-C})^{G/L}. $$

From above, your slides are claiming $ \epsilon \approx \frac{G}{L} \times e^{-C}$. This might be strange, as it appears to be equal to the expected number of bins with 0 balls, instead of the probability that there is at least 1 bin with 0 balls. Furthermore, it's strange because it contains only one exponential, compared to our exact derivation which has a double exponential.

The mismatch in the number of exponential is a hint that another mathematical approximation is at play. We can use a first-order approximation of the binomial series to obtain

$$ (1-e^{-C})^{G/L} \approx 1 - (G/L)e^{-C} $$ which holds particularly when $|e^{-C}| \times \frac{G}{L} << 1$.

In your case, $\frac{G}{L} = 30,000$, so the approximation only really starts getting accurate when $C > 15$ or so.

Following through with our approximation, we get

$$ \epsilon \approx 1 - (1 - (G/L) e^{-C}) $$

$$ \epsilon \approx \frac{G}{L} \times e^{-C}. $$

aesthete
  • 146
  • 2