What is the difference between Cross-entropy and KL divergence?

Question

Both the cross-entropy and the KL divergence are tools to measure the distance between two probability distributions, but what is the difference between them?

$$ H(P,Q) = -\sum_x P(x)\log Q(x) $$

$$ KL(P | Q) = \sum_{x} P(x)\log {\frac{P(x)}{Q(x)}} $$

Moreover, it turns out that the minimization of KL divergence is equivalent to the minimization of cross-entropy.

I want to know them instinctively.

doubllle · Accepted Answer · 2021-09-04T20:28:53.813

115

You will need some conditions to claim the equivalence between minimizing cross entropy and minimizing KL divergence. I will put your question under the context of classification problems using cross entropy as loss functions.

Let us first recall that entropy is used to measure the uncertainty of a system, which is defined as \begin{equation} S(v)=-\sum_ip(v_i)\log p(v_i)\label{eq:entropy}, \end{equation} for $p(v_i)$ as the probabilities of different states $v_i$ of the system. From an information theory point of view, $S(v)$ is the amount of information is needed for removing the uncertainty.

For instance, the event $I$ I will die within 200 years is almost certain (we may solve the aging problem for the word almost), therefore it has low uncertainty which requires only the information of the aging problem cannot be solved to make it certain. However, the event $II$ I will die within 50 years is more uncertain than event $I$, thus it needs more information to remove the uncertainties. Here entropy can be used to quantify the uncertainty of the distribution When will I die?, which can be regarded as the expectation of uncertainties of individual events like $I$ and $II$.

Now look at the definition of KL divergence between distributions A and B \begin{equation} D_{KL}(A\parallel B) = \sum_ip_A(v_i)\log p_A(v_i) - p_A(v_i)\log p_B(v_i)\label{eq:kld}, \end{equation} where the first term of the right hand side is the entropy of distribution A, the second term can be interpreted as the expectation of distribution B in terms of A. And the $D_{KL}$ describes how different B is from A from the perspective of A. It's worth of noting $A$ usually stands for the data, i.e. the measured distribution, and $B$ is the theoretical or hypothetical distribution. That means, you always start from what you observed.

To relate cross entropy to entropy and KL divergence, we formalize the cross entropy in terms of distributions $A$ and $B$ as \begin{equation} H(A, B) = -\sum_ip_A(v_i)\log p_B(v_i)\label{eq:crossentropy}. \end{equation} From the definitions, we can easily see \begin{equation} H(A, B) = D_{KL}(A\parallel B)+S_A\label{eq:entropyrelation}. \end{equation} If $S_A$ is a constant, then minimizing $H(A, B)$ is equivalent to minimizing $D_{KL}(A\parallel B)$.

A further question follows naturally as how the entropy can be a constant. In a machine learning task, we start with a dataset (denoted as $P(\mathcal D)$) which represent the problem to be solved, and the learning purpose is to make the model estimated distribution (denoted as $P(model)$) as close as possible to true distribution of the problem (denoted as $P(truth)$). $P(truth)$ is unknown and represented by $P(\mathcal D)$. Therefore in an ideal world, we expect \begin{equation} P(model)\approx P(\mathcal D) \approx P(truth) \end{equation} and minimize $D_{KL}(P(\mathcal D)\parallel P(model))$. And luckily, in practice $\mathcal D$ is given, which means its entropy $S(D)$ is fixed as a constant.

edited Sep 04 '21 at 20:28

answered Jul 19 '18 at 13:38

doubllle

1,828

1

Thank you for your answer. It deepened my understanding. So when we have a dataset, it is more effective to minimize cross- entropy rather than KL, right? However, I cannot understand the proper use of them. In other words, when should I minimize KL or cross entropy? – maso Jul 19 '18 at 14:00
3

After reading your answer, I think it is no use to minimize KL because we always have a dataset, P(D). – maso Jul 19 '18 at 14:03
1

Ideally, one would choose KL divergence to measure the distance between two distributions. In the context of classification, the cross-entropy loss usually arises from the negative log likelihood, for example, when you choose Bernoulli distribution to model your data. – doubllle Jul 19 '18 at 14:14
1

You might want to look at this great post. The symmetry is not problem in classification as the goal of machine learning models is to make predicted distribution as close as possible to the fixed P(D), though regularizations are usually added to avoid overfitting. – doubllle Jul 19 '18 at 14:35
I understood the asymmetry of KL. However, I haven't understood how to use minimization of KL or Cross-Entropy differently yet. It means that when should I minimize KL and when should I minimize Cross-Entropy. I think $S_A$ is always a constant, isn't it? – maso Jul 19 '18 at 22:15
@doubllle great answer. Please see this similar post and see if you can contribute to that as well – GENIVI-LEARNER Apr 30 '20 at 13:51
3

Re: "For instance, the event A I will die eventually is almost certain, therefore it has low entropy". Not sure what you meant to write here, but technically speaking an event has no entropy. You can define its information, and you can measure the entropy of the distribution or the system. The statement I will die eventually isn't an event either. – Amelio Vazquez-Reina May 30 '20 at 20:23
Well, when talking about events, in a strict statistical sense, I may need to specify the RV and what experiments I am conducting. But technically, we can still compute the entropy as information content for individual events or outcomes: $h(x)=-\log p(x)$. And the entropy of the distribution or the system is simply the expectation of information contents. @AmelioVazquez-Reina – doubllle May 31 '20 at 08:24
1

Yeah, of course, but as it's currently written, as per my quote above, you seem to be calling a forecast "an event", and then conclude that since the event is unlikely, it has "low entropy". Both statements are arguably incorrect and can confuse people. – Amelio Vazquez-Reina May 31 '20 at 12:14
1

First, "the entropy of an event" makes no sense as a concept and is mentioned in no book. Second, if $A$ and $B$ are events, then what are $v_i$ in your example? How do you define them? – paperskilltrees Aug 28 '21 at 03:58
Yea, it's confusing to write like this, I have edited $A$ and $B$ to be distributions – doubllle Sep 04 '21 at 19:57
Do you mean A should be the ground truth, and B should be the model for training? – Junwei Dong Sep 28 '22 at 03:26

score 37 · Answer 2 · edited Jan 08 '24 at 22:42

37

I suppose it is because the models usually work with the samples packed in mini-batches. For KL divergence and Cross-Entropy, their relation can be written as $$H(p, q) = D_{KL}(p, q)+H(p) = -\sum_i{p_i\log(q_i)}$$ so have $$D_{KL}(p, q) = H(p, q) - H(p)$$ From the equation, we could see that KL divergence can be split into a Cross-Entropy of p and q (the first part), and a global entropy of ground truth p (the second part).

In many machine learning projects, minibatch is involved to expedite training, where the $p'$ of a minibatch may be different from the global $p$. In such a case, Cross-Entropy is relatively more robust in practice while KL divergence needs a more stable H(p) to finish her job.

edited Jan 08 '24 at 22:42

lahwran

103
2

answered May 20 '19 at 17:47

zewen liu

479

7

This answer is what I was looking for. In my own current experience, which involves learning a target probabilities, BCE is way more robust than KL. Basically, KL was unusable. KL and BCE aren't "equivalent" loss functions. – Nicholas Leonard Nov 29 '19 at 16:31
When you said "the first part" and "the second part", which one was which? – Josh May 30 '20 at 20:27
The' first part' denotes $D_{KL}$(p, q), and the 'second part' means H(p). – zewen liu May 31 '20 at 21:04
A very nice tl;dr answer. Thank you! – rayryeng May 24 '22 at 16:49
1

Are you sure the 1st formula is correct? Seems the p,d are ordered wrong. – Junwei Dong Sep 28 '22 at 03:29
This is a great answer, but could benefit from giving a bit more details or information while you suppose that Cross-Entropy is more robust than KL? I would argue that it depends on your goal. E.g. do you wannt to use Cross-Entropy / KL as a loss function or as a measure to quantify drift in your data – Björn Apr 06 '23 at 08:07
I don't understand why the $H(p)$ constant makes the training less robust. The gradient should still be exactly the same, no? So is it just that your loss curve may look a bit more jiggly, but you training is still unchanged? – Thomas Ahle Dec 09 '23 at 19:33
@zewen's answer can be misleading as he claims that in mini-batch training, CE can be more robust than KL. In most of standard mini-batch training, we use gradient-based approach, and the gradient of $H(p)$ with respect to $q$ (which is a function of our model parameter) would be zero. So in these cases, CE and KL as a loss function are identical. – fatpanda2049 Sep 23 '21 at 13:41

Vivek Subramanian · Answer 3 · 2020-05-09T18:59:13.007

This is how I think about it:

$$ D_{KL}(p(y_i | x_i) \:||\: q(y_i | x_i, \theta)) = H(p(y_i | x_i, \theta), q(y_i | x_i, \theta)) - H(p(y_i | x_i, \theta)) \tag{1}\label{eq:kl} $$

where $p$ and $q$ are two probability distributions. In machine learning, we typically know $p$, which is the distribution of the target. For example, in a binary classification problem, $\mathcal{Y} = \{0, 1\}$, so if $y_i = 1$, $p(y_i = 1 | x) = 1$ and $p(y_i = 0 | x) = 0$, and vice versa. Given each $y_i \: \forall \: i = 1, 2, \ldots, N$, where $N$ is the total number of points in the dataset, we typically want to minimize the KL divergence $D_{KL}(p,q)$ between the distribution of the target $p(y_i | x)$ and our predicted distribution $q(y_i | x, \theta)$, averaged over all $i$. (We do so by tuning our model parameters $\theta$. Thus, for each training example, the model is spitting out a distribution over the class labels $0$ and $1$.) For each example, since the target is fixed, its distribution never changes. Thus, $H(p(y_i | x_i))$ is constant for each $i$, regardless of what our current model parameters $\theta$ are. Thus, the minimizer of $D_{KL}(p,q)$ is equal to the minimizer of $H(p, q)$.

If you had a situation where $p$ and $q$ were both variable (say, in which $x_1\sim p$ and $x_2\sim q$ were two latent variables) and wanted to match the two distributions, then you would have to choose between minimizing $D_{KL}$ and minimizing $H(p, q)$. This is because minimizing $D_{KL}$ implies maximizing $H(p)$ while minimizing $H(p, q)$ implies minimizing $H(p)$. To see the latter, we can solve equation (\ref{eq:kl}) for $H(p,q)$: $$ H(p,q) = D_{KL}(p,q) + H(p) \tag{2}\label{eq:hpq} $$ The former would yield a broad distribution for $p$ while the latter would yield one that is concentrated in one or a few modes. Note that it is your choice as a ML practitioner whether you want to minimize $D_{KL}(p, q)$ or $D_{KL}(q, p)$. A small discussion of this is given in the context of variational inference (VI) below.

In VI, you must choose between minimizing $D_{KL}(p,q)$ and $D_{KL}(q,p)$, which are not equal since KL divergence is not symmetric. If we once again treat $p$ as known, then minimizing $D_{KL}(p, q)$ would result in a distribution $q$ that is sharp and focused on one or a few areas while minimizing $D_{KL}(q, p)$ would result in a distribution $q$ that is wide and covers a broad range of the domain of $q$. Again, the latter is because minimizing $D_{KL}(q, p)$ implies maximizing the entropy of $q$.

In equation (1) on the left side you don't have $\theta$ in $p(y_i|x_i)$, whereas on the right side you have $p(y_i|x_i, \theta)$. Why? Also in the 5-th row you should use $x_i$ instead of $x$. — Rodvi, May 19 '20 at 13:45
Also, will the entropy $H(p)$ be typically constant in the case of generative classifiers $q(y,x|\theta)$, in the case of regression models, and in the case of non-parametric models (not assuming latent variable case)? — Rodvi, May 19 '20 at 14:05

score 2 · Answer 4 · edited Mar 13 '23 at 08:05

2

Some answers are already provided, while I would like to point out regarding the question itself

measure the distance between two probability distributions

that neither of cross-entropy and KL divergence measures the distance between two distributions-- instead they measure the difference of two distributions [1]. It's not distance because of the asymmetry, i.e. $\textrm{CE}(P,Q) \ne \textrm{CE}(Q,P)$ and $\textrm{KL}(P,Q) \ne\textrm{ KL}(Q,P).$

Reference:

[1] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning, vol. 1 (MIT Press Cambridge, 2016).

edited Mar 13 '23 at 08:05

User1865345

8,202

answered May 27 '22 at 10:28

user326765

21

Sure i guess the more precise statement would be "are often used a pseudodistance for the purposes of optimization in machine learning (and othe) problems" – Att Righ Aug 16 '23 at 07:57

What is the difference between Cross-entropy and KL divergence?

4 Answers4

Reference:

Linked