Does it make sense to directly subtract the entropy values of two distributions instead of using measures such as KL divergence or cross entropy?

Question

My supervisor has written some (relatively draft-like) R code that implements an idea from Optimal Experiment Design. However, during instead of using KL divergence

$D_{KL}(P(M | Y = y) || P(M))$

as proposed in the paper (similar to the step on page 3, return KL( mPosterior , mPrior )),

he directly used the absolute value of the entropy between the two distributions

$abs(H(P(M | Y = y)) - H(P(M)))$

I haven't got a chance to ask him about the intention of doing so yet (he's busy), so I'd like to post a question here first.

I already ran some experiments based on this code and it seems to be working and indeed has optimized the process (i.e. I now need fewer trials to have my models/beliefs converge compared to just producing trials from a flat prior).

However, since I haven't seen this formulation of just using the absolute value of the difference between the entropies of two distributions, I'd like to know what the theoretical basis of it is, why it works and why one might want to choose this instead of KL divergence. Or did I just not understand it thoroughly and this is actually just another slightly varied form of KL divergence/cross entropy? I'm relatively weak in terms of stats background.

I may have an answer but one point confuses me, and it depends on what your $P(M)$ looks like. Is it flat, by any chance? — Ruben van Bergen, Nov 17 '17 at 10:31
@RubenvanBergen Yeah the initial prior is flat in this case. Now that you mentioned it I realize that the code I was talking about corresponds more to the line of code on page 3 return KL( mPosterior , mPrior ), i.e. D_{KL}(P(M | Y = y) || P(M)) since we're trying to compare the entropy of the beliefs that we have obtained from all the experiment trial results so far, with the entropy of the beliefs updated by just one more experiment trial result, and therefore trying to give an optimal probability distribution from which the next experiment trial will be sampled. — xji, Nov 17 '17 at 14:23
Indeed several papers frame expected information gain as just H(prior) - H(posterior) (w/o abs): see https://arxiv.org/pdf/1112.5745.pdf eq 1, https://arxiv.org/pdf/1903.05480.pdf eq 1. It does seem odd as the KL definition seems to be more general... — jayelm, Oct 22 '19 at 00:55

Ruben van Bergen · Accepted Answer · 2019-10-22T13:51:41.697

These two computations are not equivalent in general. The KL-divergence can be understood to measure the difference between two distributions; in your case between your prior belief about $M$ and the posterior belief about $M$ given some hypothetical data. By choosing your experiment such that its outcome maximizes this KL-divergence, you're effectively designing your experiment to change your belief as much as possible.

What you are computing in this code instead, is the difference in entropy w.r.t. the prior. Now because your prior is flat, its entropy is maximal, which means that it can only be reduced. (So I don't entirely understand why you would want to take the absolute difference in entropy - could that just be a programming error maybe?). So with this code, you're going to try to maximize the reduction in entropy w.r.t. the prior, and that (in general) has a different interpretation from maximizing the KL-divergence. Because now you're saying: this how uncertain I am about $M$ beforehand, and I want to choose my experiment such that the outcome will make me less uncertain. In other words, this maximizes the (expected) amount of information (reduction in uncertainty) that the experiment provides, rather than maximizing the (expected) change in your belief.

So these two cases aren't equivalent in general, but the exception is if your prior on $M$ is flat, which it is in your case. If the prior is flat, then the posterior will diverge from the prior if and only if the entropy is reduced (because the only way for the posterior to diverge from the prior is to be less flat, and the less flat it is, the more different it is). So the more you reduce the entropy, the higher the KL-divergence. If, on the other hand, your prior is not flat, then you can increase the KL-divergence of the posterior while actually increasing the entropy. In other words, it is then possible for the outcome of the experiment to maximally change your belief about $M$, by actually making you more uncertain about it (this happens when the information you get from the experiment contradicts the prior).

I was just able to talk with my supervisor and he said that KL divergence probably should have been the correct measure. He just wrote the code on the road and didn't carefully consider it. Thanks for the very detailed explanation on the difference between these two measures. — xji, Nov 19 '17 at 19:33

Does it make sense to directly subtract the entropy values of two distributions instead of using measures such as KL divergence or cross entropy?

1 Answers1

Linked