Intuition of using p(x) (true distribution probability) in KL Divergence definition

Question

We all know that $D(p||q) = \sum_x p(x)log\frac{p(x)}{q(x)}$ and it is used to quantify the difference between the true distribution p and the observed distribution q. However, I do not get the intuition on why p(x) is used as the weight in the formula to calculate D(p||q). In the probabilistic point of view D(p||q) can be considered as $D(p||q) = E_{x \sim{p}}log\frac{p(x)}{q(x)}$, hence, q(x) is viewed as a constant? It would be nice if someone can help to explain the intuition of using p(x) as weight.

Does this help? https://stats.stackexchange.com/questions/188903/intuition-on-the-kullback-leibler-kl-divergence/189758#189758 — kjetil b halvorsen, Nov 18 '21 at 13:24
Does the section "Cross Entropy and KL Divergence" in the following blog help? https://leimao.github.io/blog/Cross-Entropy-KL-Divergence-MLE/ — Ganesh Tata, Nov 19 '21 at 17:06

score 0 · Answer 1 · answered Jan 17 '22 at 12:04

For me, the best intuition (or even derivation for why the forward KL is useful) comes from information theory (optimal codes specifically). I'll introduce the basics below; for much more detail, please see Chapter 5 of "Elements of Information Theory" (Thomas & Cover).

Suppose you have a message $x$ which we assume is a realisation of a RV X; denote its distribution with $p(x)$. The optimal expected code length $L$ (in bits) is then given by Shannon's bound

$$ H(p) \leq L < H(p) + 1 $$ where $H(p)$ is the Shannon entropy. In reality we do not know the true distribution $p(x)$ and approximate it with a model $q(x)$ (e.g. taking $q$ to be the empirical distribution over symbols). Then, the expected code length $L$ if we use the model $q$ instead of the true $p$ distribution is: $$ H(p) + D(p||q) \leq L < H(p) + D(p||q) + 1 $$ (For proof see Theorem 5.4.3 in Elements of Information Theory).

Hence, $D(p||q)$ is the penalty (or overhead) we pay in terms of increase in expected code length due to incorrect distribution.

Regarding the use of $p(x)$ as a weight is I think intuitive when we think about a message $x$ -- regardless of our model $q$ the messages that need to be coded will be distributed according to the unknown $p$, hence this is the distribution that determines the expected code length.

Intuition of using p(x) (true distribution probability) in KL Divergence definition

1 Answers1