Consider the Bayesian posterior $\theta\mid X$. Asymptotically, its maximum occurs at the MLE estimate $\hat \theta$, which just maximizes the likelihood $\operatorname{argmin}_\theta\, f_\theta(X)$.
All these concepts—Bayesian priors, maximizing the likelihood—sound super principled and not at all arbitrary. There’s not a log in sight.
Yet MLE minimizes the KL divergence between the real distribution $\tilde f$ and $f_\theta(x)$, i.e., it minimizes
$$ KL(\tilde f \parallel f_\theta) = \int_{-\infty}^{+\infty} \tilde f(x) \left[ \log \tilde f(x) - \log f_\theta(x) \right] \, dx $$
Woah—where did these logs come from? Why KL divergence in particular?
Why, for example, does minimizing a different divergence not correspond to the super principled and motivated concepts of Bayesian posteriors and maximizing likelihood above?
There seems to be something special about KL divergence and/or logs in this context. Of course, we can throw our hands in the air and say that’s just how the math is. But I suspect there might be some deeper intuition or connections to uncover.