I was trying to understand the mathematical proof of KL-Divergence when using the chain rule:
$D(p(x,y)||q(x,y)) = D(p(x)||q(x)) + D(p(y|x)||q(y|x))$
And I'm a bit lost in the last step (https://www.cs.princeton.edu/courses/archive/fall11/cos597D/L03.pdf, 1.3 Conditional Divergence). What I don't understand is why is this true?
$D(p(x)||q(x)) = \sum_x \sum_y p(x, y) log \frac{p(x)}{q(x)}$
The definition says something slightly different:
$D(p(x)||q(x)) = \sum_x p(x) log \frac{p(x)}{q(x)}$
I have also seen this written in another way that I still don't understand (https://homes.cs.washington.edu/~anuprao/pubs/CSE533Autumn2010/lecture3.pdf, 2.3 Conditional Divergence):
$D(p(x)||q(x)) = \sum_x p(x) log \frac{p(x)}{q(x)} \sum_y p(y|x)$
Why is that last part $\sum_y p(y|x)$ also absorbed into the KL definition?