Intuitively, why is cross entropy a measure of distance of two probability distributions?

Question

For two discrete distributions $p$ and $q$, cross entropy is defined as

$$H(p,q)=-\sum_x p(x)\log q(x).$$

I wonder why this would be an intuitive measure of distance between two probability distributions?

I see that $H(p,p)$ is the entropy of $p$, which measures "surprise" of $p$. $H(p,q)$ is the measure that partly replaces $p$ by $q$. I still do not understand the intuitive meaning behind the definition.

I recommend you to look up the mathematical definition of metric (and distance). usually, following those properties is the minimum thing a function should follow for it do be a distance. Hope it helps. Though it seems $H(p,q) = H(p) + D_{KL}(p || q )$. Intuitively, since its a function that is part of KL divergence, I'd assume it sort of the divergence of p and q offset by the entropy p. Though, its just a guess. Also, divergence isn't a metric/distance so I'd be surprised if Cross Entropy is. — Charlie Parker, Apr 24 '16 at 22:20
Then understanding Kullback_leibler divergence helps understanding cross entropy: https://stats.stackexchange.com/questions/188903/intuition-on-the-kullback-leibler-kl-divergence/189758#189758 — kjetil b halvorsen, Jun 20 '17 at 18:18
Here is a great video explaining KL Divergence in a clear and simple way: https://www.youtube.com/watch?v=ErfnhcEV1O8 — Katherine Chen, Jun 25 '18 at 14:20
See if this "Intuition behind Cross Entropy" helps: https://medium.com/@siddharth.4oct/intuition-behind-cross-entropy-baee05911e9f?source=friends_link&sk=87b07ac78204439e252601f07d5508b3 — sroy, Feb 20 '19 at 17:18

Aaron · Answer 1 · 2016-04-25T00:49:23.140

12

Minimizing the cross entropy is often used as a learning objective in generative models where p is the true distribution and q is the learned distribution.

The cross entropy of p and q is equal to the entropy of p plus the KL divergence between p and q.

$H(p, q) = H(p) + D_{KL}(p||q)$

You can think of $H(p)$ as a constant because $p$ comes directly from the training data and is not learned by the model. So, only the KL divergence term is important. The motivation for KL divergence as a distance between probability distributions is that it tells you how many bits of information are gained by using the distribution p instead of the approximation q.

Note that KL divergence isn't a proper distance metric. For one thing, it is not symmetric in p and q. If you need a distance metric for probability distributions you will have to use something else. But, if you are using the word "distance" informally then you can use KL divergence.

edited Apr 25 '16 at 00:49

answered Apr 24 '16 at 22:21

Aaron

3,275

1

why can you think of p as a constant? What are you "learning"? q? The original question didn't say anything about learning, so I'd be interested in understanding better what you meant :) – Charlie Parker Apr 24 '16 at 22:23
2

edited it to make it more clear. p is the distribution that comes from the training data and q is learned by the model. – Aaron Apr 25 '16 at 00:49
Does it matter if you minimize h(q,p) instead of h(p,q)? – Pig Jun 04 '20 at 16:29
@Pig yes it matters. The function is not symmetric in p and q. – Aaron Jun 07 '20 at 05:45
@Aaron Sorry I wasn't clear. I understand that KL divergence is not symmetric for $p,q$, but for the saking of comparing distributions in generative models, it seems hard to say one of $h(p,q)$ or $h(q,p)$ is better. Does it matter which one to use in practice? Thanks! – Pig Jun 07 '20 at 06:06

score 1 · Answer 2 · answered Apr 11 '21 at 06:06

Agree with the top answer that cross-entropy, as a distribution dissimilarity measure, may be more narrowly applicable to situations when we are comparing an estimated probability distribution (q) against the true probability distribution (p).

If we only consider the true probability distribution p, its entropy is the expected value of the corresponding log-probability (also could be viewed as the amount of "self-information": That is, the number of bits required to encode the information): $$E_{p}(log_2(p)) = -\Sigma_{i=1}^{n}p_{i}log_{2}(p_{i})$$

When computing cross-entropy between the estimated distribution q and true p, we replace the log-probability $log_2(p)$ with the estimated counterpart $log_2(q)$. After re-arranging the equation we can the following:

\begin{equation} \begin{aligned} E_{p}(log_2(q)) &= -\Sigma_{i=1}^{n}p_{i}log_{2}(q_{i})\\ &= -(\Sigma_{i=1}^{n}p_{i}log_{2}(\frac{q_{i}}{p_{i}}p_{i}))\\ &= -\Sigma_{i=1}^{n}p_{i}log_{2}(p_{i}) - \Sigma_{i=1}^{n}p_{i}log_{2}(\frac{q_{i}}{p_{i}})\\ &= entropy(p) + KLdivergence(p||q) \end{aligned} \end{equation}

In this way, the extra KL divergence term is a relative entropy measure that describes how many extra bits we would need when replacing the true distribution p with the estimated distribution q. The asymmetry is also visible (as mentioned in other answers above): $$-\Sigma_{i=1}^{n}p_{i}log_{2}(\frac{q_{i}}{p_{i}}) \neq -\Sigma_{i=1}^{n}q_{i}log_{2}(\frac{p_{i}}{q_{i}})$$

In the context of machine learning, cross-entropy is a commonly used loss function and by minimizing it we learn the model parameters.

When coming to comparing two distributions in a broader sense, you might be looking for metrics such as:

The asymmetric Kullback-Leibler divergence: scipy.special.kl_div
The symmetric Jensen-Shannon divergence: scipy.special.distance.jensenshannon

One add-on point is on base 2 vs. base e. We can also see that if we use natural logarithm cross-entropy and log loss become exchangeable. In practical implementations, natural logarithms could be directly used for cross entropy calculation (as discussed in a related thread: https://stats.stackexchange.com/questions/295174/difference-in-log-base-for-cross-entropy-calcuation/295191) — Shan Dou, Apr 11 '21 at 20:38

score 0 · Answer 3 · answered Jun 04 '20 at 00:48

0

I think cross entropy cannot be used as a distance since H(p,q) is not H(q,p). It is a uni-directional measure, so it can be used to measure the loss (or difference) from the original probability like KD divergence.

answered Jun 04 '20 at 00:48

notilas

101

2

To clarify, you use "distance" in the standard technical sense of a metric, which includes the requirement of symmetry. More generally, though, people conceive of "distances" that need not be symmetric. Consider, for instance, the case of measuring distance in a city in terms of driving time where one-way streets are involved: the time to drive from point A to point B often differs from the time for the return trip. Still, such asymmetric distances are meaningful and useful. The present question concerns such a distance and asks for an interpretation. – whuber Jun 04 '20 at 12:53

Intuitively, why is cross entropy a measure of distance of two probability distributions?

3 Answers3

Linked