If Mutual Information is $\text{KLD}(P(X, Y) || P(X)P(Y))$, why is $\text{KLD}(P(X)P(Y) || P(X, Y))$ never mentioned/used?

Question

Mutual Information is defined as the Kullback-Liebler Divergence between $P(X, Y)$ and $P(X)P(Y)$. Using definitions of KLD, this can be intuitively understood as the cost in average code length if we obtain data from the $P(X, Y)$ distribution but instead encode it optimally assuming the $P(X)P(Y)$ distribution. It can also be thought of as the expected information gain from samples to reject the null hypothesis of $P(X, Y)$ to the alternative $P(X)P(Y)$.

However, why don't we ever discuss/use the expected information gain from samples to reject the null hypothesis of $P(X)P(Y)$ to the alternative of $P(X, Y)$? The KL Divergence is not symmetric, and thus $\text{KLD}(P(X)P(Y) || P(X, Y))$ also defines a divergence in which $X$ and $Y$ are interchangeable. Why is this metric not used as much (never?) in literature?

Also, the MI has a convenient relation to the (conditional) entropy: $H(X) = H(X \mid Y) + I(X;Y)$ as seen in this figure taken from Information Theory and its Relation to Machine Learning on arXiv.org: Does such a relation exist for the reverse KLD described above?

I don't understand the comment about rejecting a null hypothesis. The joint distribution is the joint distribution. While it may or may not satisfy an independence property, I don't understand what it would mean to reject the joint distribution. — jkpate, Jun 27 '21 at 08:51
My understanding is a bit fuzzy myself, but let me try to interpret it and refer you to the source of my reasoning. — Tyler, Jun 27 '21 at 09:46
1 and 2 (the answer by @kjetilbhalvorsen). I stated it as the information required to reject the null, but I see from these sources it should be the expected information from additional samples. When the P(X, Y) is P(X)P(Y) then it becomes impossible for any information to reject P(X, Y) as the true distribution to be gained from samples (the KLD reaches 0). — Tyler, Jun 27 '21 at 09:58

jkpate · Accepted Answer · 2021-06-28T10:14:07.310

If the full distribution $P(X, Y)$ is not computationally tractable, then we may choose to work with a simpler distribution $P(X)P(Y)$. In this case, $\text{KLD}(P(X, Y) || P(X)P(Y))$ will tell us how well the factored distribution approximates the full distribution we are actually interested in. If we are able to work with the full distribution of interest, there's usually no reason to see how well the full version approximates some simpler factored version.

However, the reverse direction does come up in variational Bayesian inference, where we want to learn a distribution over some number of unobserved variables, call them $Y$ and $Z$, but only observe $X$. Variational Bayes is used when the posterior of interest $P(Y, Z | X)$ is computationally intractable. (Typically it's intractable because we have many unobserved variables, rather than just the two $Y$ and $Z$). In order to formulate an optimization problem that is tractable, variational Bayes introduces an approximating distribution over unobserved variables $Q(Y, Z)$ by defining a lower bound on the log marginal probability of the observed data: \begin{align} \log P(x) & = \log \sum_{y, z} P(x, y, z) \\ & = \log \sum_{y, z} Q(y, z) \frac{P(x, y, z)}{Q(y, z)}\\ & \geq \sum_{y, z} Q(y, z) \log \left(\frac{P(x, y, z)}{Q(y, z)}\right) \\ & = E_Q\left[P(x, y, z)\right] - E_Q\left[Q(y, z)\right] \end{align} Where the inequality is due to Jensen's inequality. We can then learn $Q$ by maximizing this lower bound on the marginal: \begin{align} Q^* & = \underset{Q}{\text{argmax}} E_Q\left[P(x, y, z)\right] - E_Q\left[Q(y, z)\right] \end{align} We can see that this is equivalent to minimizing $\text{KLD}(Q || P)$ by using negation to change the maximization problem into a minimization problem and adding $\log P(x)$ (which doesn't affect the optimum since $Q$ doesn't depend on $\log P(x)$): \begin{align} Q^* & = \underset{Q}{\text{argmin}}~ -\left(E_Q\left[P(x, y, z)\right] - E_Q\left[Q(y, z)\right]\right)\\ & = \underset{Q}{\text{argmin}}~ E_Q\left[Q(y, z)\right] - E_Q\left[P(x, y, z)\right] + \log P(x) \\ & = \underset{Q}{\text{argmin}}~ E_Q\left[Q(y, z)\right] - E_Q\left[\log P(y, z | x)\right]\\ & = \underset{Q}{\text{argmin}}~ \text{KLD}(Q(y, z) || P(y, z | x)) \end{align} To make the learning problem tractable, $Q$ is typically taken to be from a constrained family of distributions. A popular choice is the "mean-field" approximation that severs all dependencies: $Q(y, z) = Q(y)Q(z)$. In this case, we are then finding $Q$ to minimize $\text{KLD}(Q(y)Q(z) || P(y, z | z))$.

It is interesting to note that the Conditional Mutual Information $I(Y;Z|X)$ is defined as $E_Q[KLD(P(Y|X)P(Z|Y)||P(Y,Z|X))]$. Given that $Q$ is chosen as $P(Y|X)P(Z|X)$, they look very similar. Being new to the field, I'm lost in the notation. Can you see similarities and differences between where / what they are used for? — Tyler, Jun 27 '21 at 13:35
From what I understand, applying the null-hypothesis intuition to what you said: rejecting a more specific distribution is always easier than rejecting a proper superset of it. The usefulness of the reverse KLD comes into play when the original distribution is difficult to work with. We can try to find a distribution that has the lowest divergence. Got it, THANK YOU! — Tyler, Jun 27 '21 at 14:00
Note that $Q$ doesn't refer to the data $X$ at all, so it's a bit misleading to write $Q(Y|X)$ or $Q(Z | X)$. We are learning distributions over $Y$ and $Z$ such that $Q(Y)Q(Z)$ is as close to $P(Y, Z | X)$ as possible. — jkpate, Jun 28 '21 at 10:19

If Mutual Information is $\text{KLD}(P(X, Y) || P(X)P(Y))$, why is $\text{KLD}(P(X)P(Y) || P(X, Y))$ never mentioned/used?

1 Answers1