8

Let $C$ be base coverage, $R$ is the length of reads and $K$ is the length of $k$-mer. Then $k$-mer coverage $C_k$ can be computed as $C_k = C\cdot(R - K + 1)/R$.

Could someone please explain why is this equation valid (I'm mostly confused as why it is divided by $R$)?

Source: Velvet manual Section 5.1

Konrad Rudolph
  • 4,845
  • 14
  • 45
user44697
  • 263
  • 3
  • 6

2 Answers2

5

I was still puzzled from the answers, so I tried to calculate with all the steps. I take this definition "$C_k$ is the number of reads containing a k-mer." and corresponding definition for coverage ($C$): "$C$ is the number of reads covering a base".

Coverage is $C = \frac{T \cdot R}{L}$, where $T$ is total number of reads, $R$ is read length and $L$ is length of genome. Given the $C_k$ definition, $C_k = \frac{T (R - K + 1)}{L-K+1}$, where $R - K + 1$ is just number of kmers in a read, and $L-K+1$ is number of kmers in a genome. Then,

$$C_k = \frac{T (R - K + 1)}{L-K+1} = \frac{T (R - K + 1)}{L-K+1} \cdot \frac{R}{R} = \frac{R - K + 1}{R} \cdot \frac{T \cdot R}{L - K + 1}$$

since $L >> K$, we can approximate $L - K + 1 \approx L$, then we reduce the expression to

$$\frac{R - K + 1}{R} \cdot \frac{T \cdot R}{L} = \frac{R - K + 1}{R} \cdot C$$

which is the formula for $C_k$.

Kamil S Jaron
  • 5,542
  • 2
  • 25
  • 59
3

$C_k$ is defined as the number of reads containing a k-mer. The fraction of a read available to contain a k-mer is $(R-K+1)/R$, which is the number of k-mers in the read divided by its length. That times the nucleotide coverage ($C$) is then the expected coverage of the k-mer.

Devon Ryan
  • 19,602
  • 2
  • 29
  • 60