Interpreting KL Divergence

Question

I'm trying to compare different approaches to rank predictions. I have the ground truth distribution $P$ (discrete, zeta distribution) and two or more distributions ($Q, Q', Q'', Q'''$ in this case) I'd like to rank. (As in which one is better).

As I understand I can use the KL divergence for this.

What can I say then if one prediction has $D_{kl}(P || Q) = 3$ and an other one has $D_{kl}(P || Q') = 6$?
What can I say then if one prediction has $D_{kl}(P || Q'') = 0.3$ and an other one has $D_{kl}(P || Q''') = 0.6$?

I can certainly say that prediction $Q$ is better than prediction $Q'$, because it "seems" big difference, but what can I say about $Q''$ and $Q'''$ the difference seems small (I know $Q''$ is still better)?

My question is: Can I somehow numerically say that this prediction is better than that by "this amount"?

EDIT: $Q, Q', Q'', Q'''$ are all different in my question, I'm not confused about the asymmetry of $D_{kl}$, but in how to interpred different scales of results. I Edited the numbers to better reflect my question:

Both examples give that one distribution is "twice as close" to $P$ (the truth), than the other, but the scales are different. How do I interpret that?

See https://stats.stackexchange.com/questions/188903/intuition-on-the-kullback-leibler-kl-divergence/189758#189758 — kjetil b halvorsen, Apr 22 '20 at 11:53
@kjetilbhalvorsen I don't understand how this could help me, could you please elaborate on which points of the answer I should be focusing on? The answer you gave there seems for me to be about the asymmetry of Dkl. I'm specifically confused about the "scale" of the results, note in my question Q, Q', Q'', Q''' are all different. — MrAkroMenToS, Apr 22 '20 at 12:27
@kjetil-b-halvorsen is the following a good interpretation of your link? You have two situations: 1. The null (Q) is correct, that is it sufficiently matches reality (the data generating process) 2. The null does not match reality well. For case 1. then lower DL values generally indicate better ranking of the alternative hypotheses, however there is still a dependence on the distributions of P and Q, for example if the null (Q) is t-distributed and the data is normal then you could still have a small KL divergence. Increasing sample size would help by increasing the KL divergence. — Single Malt, Jun 05 '21 at 11:40
No, my answer there does not focus on the asymmetry, it focuses on "KL of alternative from null" being the expected value under the alternative of the log likelihood ratio. With your terminology read also as "KL of ground truth from model ...". ... But I will try to write an answer! — kjetil b halvorsen, Jun 05 '21 at 21:18

kjetil b halvorsen · Answer 1 · 2021-06-07T16:11:48.297

1

See my answer at Intuition on the Kullback–Leibler (KL) Divergence. The main point is that $$ D_{kl}(P || Q) $$ is the expected value of the log likelihood ratio under the alternative $P$, or, with your terminology, under the ground truth $P$ as compared to the model $Q$. So, when this is large, it means that you will be able to detect that the model is not a good description of the ground truth.

What can I say then if one prediction has $D_{kl}(P || Q) = 3$ and an other one has $D_{kl}(P || Q') = 6$?

$Q'$ is further from the ground truth $P$ than is $Q$. But if we can really distinguish this from data depends also on sample size, but you will need a smaller sample size with $Q'$ than with $Q$.

What can I say then if one prediction has $D_{kl}(P || Q'') = 0.3$ and an other one has $D_{kl}(P || Q''') = 0.6$?

Much the same, only that now you will need much larger sample sizes.

edited Jun 07 '21 at 16:11

answered Jun 05 '21 at 21:30

kjetil b halvorsen

77,844

This is a helpful and understandable answer. Perhaps I have this as a misunderstanding, but in real-world use the ground truth $P$ will often be unknown? Hence it seems that the KL divergence value depends on how well the underlying reality (data generating process) is represented by $P$? – Single Malt Jun 07 '21 at 13:59
2

@Single Malt That is exactly correct.
I highly recommend Cover and Thomas for information theory. The text is very clear about intepretation:

http://staff.ustc.edu.cn/~cgong821/Wiley.Interscience.Elements.of.Information.Theory.Jul.2006.eBook-DDU.pdf
– user3716267 Jun 07 '21 at 18:40

Interpreting KL Divergence

1 Answers1