Why does Kullback–Leibler divergence measure information loss when approximating a probability distribution?

Question

I've encountered a sentence:

In information theory, Kullback–Leibler divergence is regarded as a measure of the information lost when probability distribution Q is used to approximate a true probability distribution P.

When I think about it, I don't know why is it true. From information theory I know that Kullback–Leibler divergence tells by how many bits the average code is longer when we optimize it using Q, but use for messages generated from true P distribution.

I've tried to find some resources to understand it by myself, for example here: www.countbayesie.com. The problem is that this statement is just assumed to be true there.

The biggest problem for me is how to translate "information loss" in the situation explained there (approximating the observed distribution using uniform vs. binomial distributions) with the average code length.

Ideally I'd like to connect it using theoretically valid assumptions, reasoning and formulas, instead of intuitions (but they could be helpful as well).

If there is any book which explains it I'm happy to look at it.

I believe the sections of this article on "Interpretations" and "Bayesian updating" answer the question well. Is there anything still unclear after reading those? — Arya McCarthy, May 03 '21 at 17:46
Not an answer, but have a look at https://stats.stackexchange.com/questions/188903/intuition-on-the-kullback-leibler-kl-divergence/189758#189758 — kjetil b halvorsen, May 04 '21 at 02:25

Why does Kullback–Leibler divergence measure information loss when approximating a probability distribution?

0 Answers0