KL-divergence between two products

Question

Given factorizations of two joint densities $p(x_1,...,x_n)=\prod_{i=1}^n p(x_i\mid \textrm{cond}(x_i))$ and $q(x_1,...,x_n)=\prod_{i=1}^n q(x_i\mid \textrm{cond}(x_i))$, where $\textrm{cond}(\bullet)$ denotes the set of conditioning variables, does the KL-divergence decompose, i.e., does

$\textrm{KL}(p\Vert q)= \sum_{i=1}^n \textrm{KL}\left(p(x_i\mid \textrm{cond}(x_i))\ \Vert\ q(x_i\mid \textrm{cond}(x_i))\right)$

hold?

If it makes things easier, you can assume $\textrm{cond}(x_i) \in {x_1,\ldots,x_n}$ . More generally, the two factorisations are Bayesian networks, which means that $\textrm{cond}(x_i)$ can be any subset of ${x_1,\ldots,x_n}$ so that the induced graph structure is a directed acyclic graph. — ASML, Jan 23 '16 at 14:23
How do you define $\textrm{KL}\left(p(x_i\mid \textrm{cond}(x_i))\ \Vert\ q(x_i\mid \textrm{cond}(x_i))\right)$ ? The ambiguous part is how you integrate out the $cond(x_i)$ variables in there — Guillaume Dehaene, Feb 03 '16 at 13:42

score 1 · Answer 1 · answered Sep 28 '19 at 13:41

Your formula is incorrect, it does not say where these conditioning variables come from.

Assuming the set of conditioning variables $\text{cond}_i = \text{cond}(x_i)$ depends only on the index $i$ and not on the value of $x_i$, we have $$ \begin{align*} \textrm{KL}(p\Vert q) &= \sum_{i=1}^n \mathbb{E}_{p(x_1, \dots, x_n)} \log \left(\frac{p(x_i\mid \textrm{cond}_i)}{q(x_i\mid \textrm{cond}_i)} \right) \\ &= \sum_{i=1}^n \mathbb{E}_{p(x_i, \text{cond}_i)} \log \left(\frac{p(x_i\mid \textrm{cond}_i)}{q(x_i\mid \textrm{cond}_i)} \right) \\ &= \sum_{i=1}^n \mathbb{E}_{p(\text{cond}_i)} \mathbb{E}_{p(x_i \mid \text{cond}_i)} \log \left(\frac{p(x_i\mid \textrm{cond}_i)}{q(x_i\mid \textrm{cond}_i)} \right) \\ &= \sum_{i=1}^n \mathbb{E}_{p(\text{cond}_i)} \text{KL}\left(p(x_i\mid \textrm{cond}_i) \mid\mid q(x_i\mid \textrm{cond}_i) \right) \\ \end{align*} $$

Sergio · Answer 2 · 2018-03-27T13:43:17.337

As ASML suggested you can represent the factorization of the joint probability distribution according to a Bayesian network. Then, as it is pointed in [Tong, S., & Koller, D. (2001)] (page 4) the KL-divergence decomposes with the graphical structure of the network.

[Tong, S., & Koller, D. (2001)] Tong, S., & Koller, D., Active learning for parameter estimation in Bayesian networks. In Advances in neural information processing systems (pp. 647-653).

KL-divergence between two products

2 Answers2