Why minimize the KL divergence from data to the model instead of KL divergence of model to data?

Question

When minimizing the KL divergence in machine learning, why the expression of KL is from data to model, instead of from model to data ?

not sure if you're referring to variational inference, but if you are, minimizing the kl divergence in the other direction is called expectation propagation — aleshing, Sep 12 '18 at 20:48
See also https://stats.stackexchange.com/questions/482362/kl-divergence-pq-vs-qp/482374#482374 — kjetil b halvorsen, Sep 21 '21 at 22:27

score 1 · Answer 1 · answered Sep 13 '18 at 00:27

One explanation is that this is what maximizing (log) likelihood gives you.

Suppose you have observed coin tosses $y_i \in \{0,1\}, i=1,\dots,n$ and wish to estimate the coin flip probability $p$. Maximizing likelihood is \begin{align*} \arg \max_p \prod_{i=1}^n p^{y_i}(1-p)^{1-y_i} &= \arg\max_p \sum_{i=1}^n y_i \log p + (1-y_i)\log(1-p) \\ &= \arg\max_p -\sum_{i=1}^n y_i \log \frac{1}{p} + (1-y_i)\log \frac{1}{1-p} \\ &= \arg\min_p \sum_{i=1}^n y_i \log \frac{y_i}{p} + (1-y_i)\log \frac{1-y_i}{1-p} \\ &= \arg\min_p \sum_{i=1}^n \text{KL}(y_i,p) \end{align*} since $0 \log 0 = 0$.

Why minimize the KL divergence from data to the model instead of KL divergence of model to data?

1 Answers1