5

I am currently reading 'Dive into Deep Learning' and right now I am trying to improve my intuition for the Kullback–Leibler divergence. I get the basic idea, why this metric is not symmetric, however, I do not understand the massive divergence in this example, taken from the book :

"First, let’s generate and sort three tensors of length 1000: an objective tensor $p$, which follows a normal distribution $N(0,1)$ , and two candidate tensors $q_1$ and $q_2$ which follow normal distributions $N(-1,1)$ and $N(1,1)$ respectively."

tensor_len = 10000
p = torch.normal(0, 1, (tensor_len, ))
q1 = torch.normal(-1, 1, (tensor_len, ))
q2 = torch.normal(1, 1, (tensor_len, ))

They continue to compare $D_{KL}(q_2 || p)$ and $D_{KL}(p || q_2)$ :

kl_pq2 = kl_divergence(p, q2)   
kl_q2p = kl_divergence(q2, p)

kl_pq2, kl_q2p (8582.0341796875, 14130.125)

This difference does not really make sense to me, the distributions behave very similar to me. When $q_2(x)$ is likely, $p(x)$ is not and vice versa. Regardless of what distribution is our observed distribution, the similarities in the pdf's are the same. This is probably poorly phrased, but since I am quite new to this topic and this is about intuition, this is the best I can do.

I also did my own calculations :

gaussian = np.random.normal(loc=0.0, scale=1.0, size=1000)
gaussian_2 = np.random.normal(loc=1, scale=1.0, size=1000)

pdf1 = norm.pdf(gaussian, loc = 0, scale = 1) pdf2 = norm.pdf(gaussian_2, loc = 1, scale = 1)

print(f'entropy p||q: {entropy(pdf1, pdf2)}') print(f'entropy q||p: {entropy(pdf2, pdf1)}')

entropy p||q: 0.25600232157755665 entropy q||p: 0.2552700353936643

Which made more sense to me, but confused me even more. Can anyone explain to me where this divergence is coming from and why I get different results?

kklaw
  • 515

1 Answers1

8

Note that the Kullback-Leibler divergence, which as you rightly point out is not symmetric, is

$$ \begin{align*} \text{KL}(P\vert\vert Q) & = \int p(x)\log\left(\frac{p(x)}{q(x)}\right) \,\text{d}x = \int p(x)\log \left(\frac{1}{q(x)}\right)\,\text{d}x - \int p(x)\log \left(\frac{1}{p(x)}\right)\text{d}x\\ & = H(P,Q) - H(P), \end{align*} $$

where $H(P,Q)$ is the cross-entropy of $P$ and $Q$, and $H(P)$ is the entropy of $P$, or the cross-entropy of $P$ with itself. I'm not familiar with python, but I guess that entropy in your code computes $H(P, Q)$.

As you can see from its definition, $H(P, Q)$ is generally different from $H(Q, P)$, which is another way to see why in general $\text{KL}(P||Q)\neq\text{KL}(Q||P)$.

Now, the KL for two normal models $p(x) = N(\mu_1,\sigma_1^2)$ and $q(x) = N(\mu_2,\sigma_2^2)$ is

$$ \text{KL}(p||q) = \log \frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2 \sigma_2^2} - \frac{1}{2}. \quad(*) $$

Although there is a great deal of overlapping between $N(0, 1)$ and $N(1, 1)$, the two densities are different (see picture), thus we expect that both versions of KL are not equal to zero. The point is thus, are the two versions of KL equal or not?

Using (*), in this particular case, as you almost correctly conjectured, it turns out that the two KL's are equal, i.e.

$$\text{KL}(N(0,1)|| N(1,1)) = \text{KL}(N(1,1)|| N(0,1)) = 0.5$$

enter image description here

utobi
  • 11,726
  • 1
    I agree that they are different, but to me they seem symmetric still. To me, the ratio $\log(\frac{p(x)}{q(x)})$ is equal to $\log(\frac{q(x)}{p(x)})$, because I can find that same ratio somewhere on the difference of both distributions. Let's take some $pdf(x)$ from our observed distribution $p(x)$ and compare it against $q(x)$. I will find that same ratio if I take $pdf(x)$ from $q(x)$ and compare it to $p(x)$ – kklaw Oct 30 '22 at 20:13
  • When $p=N(0,1)$, $q=N(1,1)$ that equality is reached only at $x=0$. – utobi Oct 30 '22 at 20:22
  • Indeed, but take for example $x=-2$. The difference $p(-2) - q(-2)$ is found again at approx. $x = 3$, or $x = 2 + \sigma$. And to me, I can apply that same logic to every point. – kklaw Oct 30 '22 at 20:25
  • 2
    I'm not sure I follow your point, log(()())=−log(()()), so they are different things; besides this, when you flip the order of the distributions in the KL, you integrate the ratio w.r.t. to a different measure. – utobi Oct 30 '22 at 20:37
  • 1
    @utobi: In your last comment, there must be missing something, probably simply a / – kjetil b halvorsen Oct 31 '22 at 01:24
  • Another thing is that the code I ran with Python 100% calculates the Kullback Leibler Divergence : scipy.stats.entropy(pk, qk=None, base=None, axis=0) , which is the method I used, has the additional information in their documentation: "If qk is not None, then compute the Kullback-Leibler divergence", which is true in my code. – kklaw Oct 31 '22 at 08:08
  • One more thing, I found this thread : https://stats.stackexchange.com/questions/7440/kl-divergence-between-two-univariate-gaussians. Looking at the result, I feel like only the variance is responsible for differences in the divergence. But the variance is the same in our case, so shouldn't this thread support my claim? – kklaw Nov 01 '22 at 18:28
  • 1
    @kklaw yes. In this particular case, both KL's are equal to 0.5. see my updated answer – utobi Nov 02 '22 at 20:01
  • @kklaw oh yes, you are right. – utobi Nov 02 '22 at 20:54