I am not a mathematician. I have searched the internet about KL Divergence. What I learned is the the KL divergence measures the information lost when we approximate distribution of a model with respect to the input distribution. I have seen these between any two continuous or discrete distributions. Can we do it between continuous and discrete or vice versa?
-
Related: http://stats.stackexchange.com/q/6907/2970 – cardinal Sep 04 '13 at 01:28
3 Answers
Yes, the KL divergence between continuous and discrete random variables is well defined. If $P$ and $Q$ are distributions on some space $\mathbb{X}$, then both $P$ and $Q$ have densities $f$, $g$ with respect to $\mu = P+Q$ and $$ D_{KL}(P,Q) = \int_{\mathbb{X}} f \log\frac{f}{g}d\mu. $$
For example, if $\mathbb{X} = [0,1]$, $P$ is Lebesgue's measure and $Q = \delta_0$ is a point mass at $0$, then $f(x) = 1-\mathbb{1}_{x=0}$, $g(x) = \mathbb{1}_{x=0}$ and $$D_{KL}(P, Q) = \infty.$$
- 819
-
How do you prove that $\int_{\mathbb{X}} f \log\frac{f}{g}d\mu$ is independent of the dominating measure ? – Gabriel Romon Feb 01 '19 at 11:36
-
1
-
@Olivier with that $f$ you get $P([0,0.5]) = -0.5$ but probability measures must be non-negative. Try maybe a convex sum between the two. – Jorge E. Cardona Jun 30 '20 at 20:12
-
-
1Ok, I get it now, is there a case when $D_{KL}(P,Q)$ is finite? @Olivier – Jorge E. Cardona Jul 01 '20 at 16:09
-
Much later: @JorgeE.Cardona, I believe the answer to your question is no; see my just substantially-updated answer. – Danica Apr 20 '23 at 06:58
KL divergence is only defined on distributions over a common space. If $p$ is a distribution on $\mathbb{R}^3$ and $q$ a distribution on $\mathbb{Z}$, then $q(x)$ doesn't make sense for points $p \in \mathbb{R}^3$ and $p(z)$ doesn't make sense for points $z \in \mathbb{Z}$.
However, if you have a discrete distribution over the same space as a continuous distribution, e.g. both on $\mathbb R$ (although the discrete distribution obviously doesn't have support on all of $\mathbb R$), the KL divergence can be defined, as in Olivier's answer.
To do this, we have to use densities with respect to a common "dominating measure" $\mu$: if $\frac{\mathrm d P}{\mathrm d \mu} = p$ and $\frac{\mathrm d Q}{\mathrm d \mu} = q$, then $\operatorname{KL}(P \,\|\, Q) = \int p(x) \log \frac{p(x)}{q(x)} \mathrm d\mu(x)$. These densities are called Radon-Nikodym derivatives, and $\mu$ should dominate the distributions $P$ and $Q$.
This is always possible, e.g. by using $\mu = P + Q$ as Olivier did. Also, I believe (agreeing with Olivier's comments) that the value of the KL divergence should be invariant to the choice of dominating measure, though I haven't written out a full proof – so the choice of $\mu$ shouldn't matter.
Then, if one distribution is continuous and the other discrete, both directions of the KL are always infinite (shown below). The behaviour for other $f$-divergences is essentially the same; see here.
This makes it not a very interesting measure. The same is true, I believe, for any $f$-divergence (not always infinite, but always a particular constant value). A much more interesting class of measures are the integral probability metrics, $$D(P, Q) = \sup_{f \in \mathcal F} \mathbb E_{X \sim P} f(X) - \mathbb E_{Y \sim Q} f(Y),$$ particularly the Wasserstein distance or also the kernel MMD, both of which can meaningfully compare distributions with different supports.
To see this: I find it kind of confusing to take derivatives with respect to $P + Q$, so let's do something I find simpler.
- If $P$ is a continuous distribution, then typically what we mean by that is that it's dominated by Lebesgue measure $\lambda$, and has density $p$ with respect to $\lambda$, where $p$ is exactly the typical probability density function.
- If $Q$ is a discrete distribution, its support is some at-most-countable set; call it $K$, and define the measure $k(A) = \lvert A \cap K \rvert$. Then $Q$ has density $q$ with respect to $k$, where $q$ is exactly the typical probability mass function.
- So, let's use as our base measure $\mu = \lambda + k$. (Note that indeed $\mu$ is $\sigma$-finite.)
Now, $\frac{\mathrm d P}{\mathrm d \mu}$ is a function $\tilde p$ such that $P(A) = \int_A \tilde p(x) \mathrm d \mu(x) = \int_A \tilde p(x) \mathrm d \lambda(x) + \sum_{x \in K} \tilde p(x)$. To make this work, we should choose $$\tilde p(x) = \begin{cases} 0 & \text{if } x \in K \\ p(x) & \text{otherwise} \end{cases}.$$ Then $\sum_{x \in K} \tilde p(x) = 0$, and $$\int_A \tilde p(x) \mathrm d \lambda(x) = \int_A p(x) \mathrm d \lambda(x) - \int_K p(x) \mathrm d \lambda(x) = P(A) - 0,$$ since $\lambda(K) = 0$. Thus $\tilde p$ is a valid derivative.
The $Q$ derivative is similar: pick $\tilde q(x) = \begin{cases} q(x) & \text{if } x \in K \\ 0 & \text{otherwise} \end{cases}$.
Armed with these derivatives, we can now evaluate the KL divergence.
For one direction, we have \begin{align*} \operatorname{KL}(P \,\|\, Q) &= \int \tilde p(x) \log \frac{\tilde p(x)}{\tilde q(x)} \mathrm d\mu(x) \\&= \int \tilde p(x) \log \frac{\tilde p(x)}{\tilde q(x)} \mathrm d\lambda(x) + \sum_{x \in K} \tilde p(x) \log \frac{\tilde p(x)}{\tilde q(x)} .\end{align*} In the last line, each term inside the sum will always be $0 \log \frac{0}{\text{nonzero}}$. This is unfortunate, but typically we'll treat the definition of KL divergence (and entropy) as defining that as zero, and the sum is zero. Meanwhile, each element of the integral is going to be $\tilde p(x) \log \frac{\tilde p(x)}{0}$, with many nonzero values for $\tilde p(x)$ (everything in $\operatorname{supp}(P) \setminus K$). That makes the integral, and $\operatorname{KL}(P \,\|\, Q)$, infinity.
The other direction is much the same; we get $$ \operatorname{KL}(Q \,\|\, P) = \int \tilde q(x) \log \frac{\tilde q(x)}{\tilde p(x)} \mathrm d\lambda(x) + \sum_{x \in K} \tilde q(x) \log \frac{\tilde q(x)}{\tilde p(x)} ,$$ and everything in the integral is $0 \log 0$ while the sum is $\sum_{x \in K} q(x) \log \frac{q(x)}{0}$, making this direction of KL also infinite.
- 24,685
-
3Note that the KL divergence between discrete and absolutely continuous distributions is well defined. – Olivier Jun 06 '17 at 11:37
-
@Olivier The usual definition requires a common dominating measure, no? – Danica Jun 06 '17 at 12:01
-
1You are right when P and Q are defined on different spaces. But on a common measurable space, such a measure always exist (take P+Q for instance), and the KL divergence does not depend on the particular choice of dominating measure. – Olivier Jun 06 '17 at 12:39
-
1(Just a note that the comments above applied to a ten-year-old answer that was basically just the current first paragraph; they're no longer relevant to the current answer, which has accounted for this.) – Danica Apr 20 '23 at 17:40
Not in general. The KL divergence is
$$ D_{KL}(P \ || \ Q) = \int_{\mathcal{X}} \log \left(\frac{dP}{dQ}\right)dP $$
provided that $P$ is absolutely continuous with respect to $Q$ and both $P$ and $Q$ are $\sigma$-finite (i.e. under conditions where $\frac{dP}{dQ}$ is well-defined).
For a 'continuous-to-discrete' KL divergence between measures on some usual space, you have the case where Lebesgue measure is absolutely continuous with respect to counting measure, but counting measure is not $\sigma$-finite.
- 1,586