When minimizing the KL divergence in machine learning, why the expression of KL is from data to model, instead of from model to data ?
Asked
Active
Viewed 438 times
2
-
not sure if you're referring to variational inference, but if you are, minimizing the kl divergence in the other direction is called expectation propagation – aleshing Sep 12 '18 at 20:48
-
See also https://stats.stackexchange.com/questions/482362/kl-divergence-pq-vs-qp/482374#482374 – kjetil b halvorsen Sep 21 '21 at 22:27
1 Answers
1
One explanation is that this is what maximizing (log) likelihood gives you.
Suppose you have observed coin tosses $y_i \in \{0,1\}, i=1,\dots,n$ and wish to estimate the coin flip probability $p$. Maximizing likelihood is \begin{align*} \arg \max_p \prod_{i=1}^n p^{y_i}(1-p)^{1-y_i} &= \arg\max_p \sum_{i=1}^n y_i \log p + (1-y_i)\log(1-p) \\ &= \arg\max_p -\sum_{i=1}^n y_i \log \frac{1}{p} + (1-y_i)\log \frac{1}{1-p} \\ &= \arg\min_p \sum_{i=1}^n y_i \log \frac{y_i}{p} + (1-y_i)\log \frac{1-y_i}{1-p} \\ &= \arg\min_p \sum_{i=1}^n \text{KL}(y_i,p) \end{align*} since $0 \log 0 = 0$.
elexhobby
- 845
- 1
- 6
- 16