2

I've encountered a sentence:

In information theory, Kullback–Leibler divergence is regarded as a measure of the information lost when probability distribution Q is used to approximate a true probability distribution P.

When I think about it, I don't know why is it true. From information theory I know that Kullback–Leibler divergence tells by how many bits the average code is longer when we optimize it using Q, but use for messages generated from true P distribution.

I've tried to find some resources to understand it by myself, for example here: www.countbayesie.com. The problem is that this statement is just assumed to be true there.

The biggest problem for me is how to translate "information loss" in the situation explained there (approximating the observed distribution using uniform vs. binomial distributions) with the average code length.

Ideally I'd like to connect it using theoretically valid assumptions, reasoning and formulas, instead of intuitions (but they could be helpful as well).

If there is any book which explains it I'm happy to look at it.

Martyna
  • 21

0 Answers0