0

I'm looking at the following implementation of a VAE: https://github.com/jmtomczak/vae_vpflows/blob/master/models/VAE.py

KL divergence is implemented as:

# KL
log_p_z = log_Normal_standard(z_q, dim=1)
log_q_z = log_Normal_diag(z_q, z_q_mean, z_q_logvar, dim=1)
KL = -(log_p_z - log_q_z)

z_q is a batch of samples from the latent space p(z|x) and z_q_mean and z_q_logvar are the predicted means and log variances from which the sample is drawn. log_Normal_standard and log_Normal_diag are implemented as follows:

def log_Normal_diag(x, mean, log_var, average=False, dim=None):
    log_normal = -0.5 * ( log_var + torch.pow( x - mean, 2 ) / torch.exp( log_var ) )
    if average:
        return torch.mean( log_normal, dim )
    else:
        return torch.sum( log_normal, dim )

def log_Normal_standard(x, average=False, dim=None):
    log_normal = -0.5 * torch.pow( x , 2 )
    if average:
        return torch.mean( log_normal, dim )
    else:
        return torch.sum( log_normal, dim )

I'm unfamiliar with this calculation of KL divergence for lognormal distributions and I can't find any supplementary material that matches this formulation.

Can anyone point me to equations that match this formulation?

  • Check out https://stats.stackexchange.com/questions/7440/kl-divergence-between-two-univariate-gaussians – Jan Kukacka Jun 26 '19 at 07:47
  • to clarify: it's not a lognormal distribution, it's the log density of a normal distribution. – shimao Jun 26 '19 at 08:02
  • @JanKukacka I've looked at the derivation you provided but it is quite different to what I see here. On further reading the KL divergence expressed as (log_p_z - log_q_z) seems to be specified in terms of the log density ratio which I read about here https://tiao.io/post/density-ratio-estimation-for-kl-divergence-minimization-between-implicit-distributions/. So it isn't directly comparing distributions but their densities at z. The log_p_z density is given by the standard normal density function – Michael Anslow Jun 26 '19 at 15:19
  • https://wikimedia.org/api/rest_v1/media/math/render/svg/3123d8dd4c3386afe9fac119fed2cfaf7ce9f336. I suppose we ignore the normalising term for it because this is not important in optimisation? The same is done for the generalised normals distribution https://wikimedia.org/api/rest_v1/media/math/render/svg/dabaca1788ef8fdca1741f3481e862131ac54059. – Michael Anslow Jun 26 '19 at 15:27

1 Answers1

0

If you recall the KL Divergence between two distributions is given by: $\mathcal{D}_{KL} ( P \mid \mid Q) = \underset{x \in X}{\sum} P(x) \ln \left[ \frac{P(x)}{Q(x)} \right] $. This is interpreted as the Expected Value of the Log of the Density Ratios of the Distributions P and Q, with respect to the variable X, which is distributed as $P(x)$. If you need further convincing, by expanding the previous expression we obtain:

$$\mathcal{D}_{KL} ( P \mid \mid Q) = \left[ \underset{x \in X}{\sum} P(x) \cdot \left( \ln P(x) - \ln Q(x) \right) \right] $$

$$ \underset{x \sim p(x)}{\mathbb{E}} \left[ \ln P(x) - \ln Q(x) \right]$$

By substituting $P(x)$ for the variational posterior, $q_{\phi} (z \mid x)$ and $Q(x)$ for the prior, $p(z)$ we obtain an expression for the divergence term , formulated as an Expectation below:

$$ \underset{z \sim q(z \mid x)}{\mathbb{E}} \left[ \ln q (z \mid x) - \ln p(z) \right]$$

$$ \underset{z \sim q(z \mid x)}{\mathbb{E}} \ln \left[ \frac{q(z \mid x)}{p(z)} \right]$$

If you recall, the second version of the SGVB estimator actually calls for the negative of the KL Divergence: $$\overset{\sim}{\mathcal{L}^{B}} (\theta, \phi, x^{(i)}) = -D_{KL} (q_{\phi} (z \mid x^{(i)}) \mid \mid p_{\theta}(z)) + \frac{1}{L} \sum\limits_{l=1}^{L} (\log p_{\theta} (x^{(i)} \mid z^{(i, l)}) )$$.

For each of the latents in the hierarchy, $-D_{KL}$ is computed using the log of the Gaussian Distribution, the constants are omitted from the expression, as they will be cancelled out when the terms are subtracted.