Formula for k-mer coverage

Question

Let $C$ be base coverage, $R$ is the length of reads and $K$ is the length of $k$-mer. Then $k$-mer coverage $C_k$ can be computed as $C_k = C\cdot(R - K + 1)/R$.

Could someone please explain why is this equation valid (I'm mostly confused as why it is divided by $R$)?

Source: Velvet manual Section 5.1

Kamil S Jaron · Answer 1 · 2017-07-27T10:54:57.303

I was still puzzled from the answers, so I tried to calculate with all the steps. I take this definition "$C_k$ is the number of reads containing a k-mer." and corresponding definition for coverage ($C$): "$C$ is the number of reads covering a base".

Coverage is $C = \frac{T \cdot R}{L}$, where $T$ is total number of reads, $R$ is read length and $L$ is length of genome. Given the $C_k$ definition, $C_k = \frac{T (R - K + 1)}{L-K+1}$, where $R - K + 1$ is just number of kmers in a read, and $L-K+1$ is number of kmers in a genome. Then,

$$C_k = \frac{T (R - K + 1)}{L-K+1} = \frac{T (R - K + 1)}{L-K+1} \cdot \frac{R}{R} = \frac{R - K + 1}{R} \cdot \frac{T \cdot R}{L - K + 1}$$

since $L >> K$, we can approximate $L - K + 1 \approx L$, then we reduce the expression to

$$\frac{R - K + 1}{R} \cdot \frac{T \cdot R}{L} = \frac{R - K + 1}{R} \cdot C$$

which is the formula for $C_k$.

score 3 · Accepted Answer · answered Jul 26 '17 at 20:45

$C_k$ is defined as the number of reads containing a k-mer. The fraction of a read available to contain a k-mer is $(R-K+1)/R$, which is the number of k-mers in the read divided by its length. That times the nucleotide coverage ($C$) is then the expected coverage of the k-mer.

Formula for k-mer coverage

2 Answers2