Given two distributions, one a parameterized gaussian and the other a standard normal gaussian:
$q(x) \sim \mathcal{N}(\mu,\sigma)$
$p(x) \sim \mathcal{N}(0,I)$
We want to compute the KL Divergence $D_{KL}(q(x)||p(x))$. It is widely known that we can compute this in closed form solution such that the total KL divergence results in:
$= \sum_{i=0}^{D}(1 + \log(\sigma_i²) - \mu_i² - \sigma_i²)$
For a random vector with dimension $D$.
However, I tried to derive this from a different perspective and don't understand what I'm getting wrong... would really appreciate if someone could help me out here!
For a random variable $x \sim \mathcal{N}(\mu,\sigma)$, we can reparameterize it by drawing from a noise variable $\epsilon \sim \mathcal{N}(0,1)$ and setting $x = \mu + \sigma\epsilon$.
Next, the KL Divergence is given as:
$D_{KL}(q(x)||p(x)) = \int q(x)\: log\frac{q(x)}{p(x)} = \mathbb{E}_{q(x)}[q(x) - p(x)]$
The log of a Gaussian with $\mu,\sigma$ as:
$\log \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{1}{2}{\left(\frac{x-\mu}{\sigma}\right)}^{2}} = -\log{\sigma} - \frac{1}{2}\log(2\pi) - \frac{1}{2} {\left(\frac{x-\mu}{\sigma}\right)}^{2}$
And the log of a standard normal Gaussian with random variable $\epsilon$ as:
$\log \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2}{\epsilon}^{2}} = - \frac{1}{2}\log(2\pi) -\frac{1}{2}{\epsilon}^{2}$
So why can't we simply do:
$$ \begin{eqnarray} \mathbb{E}_{q(x)}[q(x) - p(x)] &=& \mathbb{E}_{q(x)}[\log{\sigma} - \frac{1}{2}\log(2\pi) - \frac{1}{2} {\left(\frac{x-\mu}{\sigma}\right)}^{2} - (- \frac{1}{2}\log(2\pi) -\frac{1}{2}{\epsilon}^{2})] \\ &=& \mathbb{E}_{q(x)}[\log(\sigma) - \frac{1}{2} {\left(\frac{x-\mu}{\sigma}\right)}^{2} -\frac{1}{2}{\epsilon}^{2}] \\ &=& \mathbb{E}_{p(\epsilon)}[\log(\sigma) - \frac{1}{2} {\left(\frac{\mu - \sigma\epsilon -\mu}{\sigma}\right)}^{2} +\frac{1}{2}{\epsilon}^{2}] \\ &=& \mathbb{E}_{p(\epsilon)}[\log(\sigma) - \frac{1}{2}{\epsilon}^{2} +\frac{1}{2}{\epsilon}^{2}] \\ &=& \mathbb{E}_{p(\epsilon)}[\log(\sigma)] \\ &=& \log(\sigma) \end{eqnarray} $$
Something is really missing here :( I thought that by we are allowed to plug-in the reparameterization of $x = \mu + \sigma\epsilon$ and thus change the distribution of expected value from $q(x)$ to $p(\epsilon)$.