Suppose I am adjusting distribution $Q$ to make get a best fit for distribution $P$, should I minimize $KL(P||Q)$ or $KL(Q||P)$? What is the difference?
Related question:
Suppose I am adjusting distribution $Q$ to make get a best fit for distribution $P$, should I minimize $KL(P||Q)$ or $KL(Q||P)$? What is the difference?
Related question:
You usually want $KL(Q||P)$. That's from $P$ to $Q$. I remember that it goes from right to left, just like the notation for conditional probabilities.
You want the expectation being taken with respect to the true distribution $P$. That way, sample averages can be assumed to converge to the true expectations, by the law of large numbers.
The KL divergence is not a distance this is why the alternative word divergence is used instead. If you want symmetry you can take the sum of $KL(Q||P)$ and $KL(P||Q)$ as mentioned in one of the answers in the linked post.
The intuition is that you do not know the true distribution $Q$ so you make an estimate or a guess of the true distribution $P$. The two may be in the same parametric family or they may not be similar at all. Therefore to understand some notion of how far your assigned probabilities (a view of the events) are from the true probabilities (how much the two perspectives diverge) you would take an expectation under your estimated probabilities. However, you would still like this divergence to be $0$ if you somehow specified the exact model, to ensure this you subtract that true value. Using the logarithm property $log(x/y) =log(x) - log(y)$ then gives you the KL-divergence equation.