Why do we need Jensen inequality for variational autoencoders?

Question

Just to clarify, I think I understand all the derivations in context of VAEs pretty well; however, there is one last thing that I need explained. There are multiple related derivations of the evidence lower bound (ELBO). The following uses Jensen's Inequality to form the bound.

$$ \begin{align} & \operatorname*{argmax}_\Theta \log \mathbb{E}_{z \sim q(z|x)}[p(x|z) * \frac{p(z)}{q(z|x)}] \\ &\geq \operatorname*{argmax}_\Theta \mathbb{E}_{z \sim q(z|x)}[\log(p(x|z) * \frac{p(z)}{q(z|x)})] && \text{Jensen's inequality} \\ &= \operatorname*{argmax}_\Theta \mathbb{E}_{z \sim q(z|x)}[\log p(x|z) + \log p(z) - \log q(z|x)] \\ &= \operatorname*{argmax}_\Theta \mathbb{E}_{z \sim q(z|x)}[\log p(x|z)] + \mathbb{E}_{z \sim q(z|x)}[\log p(z) - \log q(z|x)] \\ &= \operatorname*{argmax}_\Theta \mathbb{E}_{z \sim q(z|x)}\left[\log p(x|z)\right] - D_{KL}[q(z|x)\parallel p(z)] \end{align} $$

However, is there a reason why we need to apply Jensen's Inequality here? Can't we just sample approximate the term like it is? Is it purely because of numerical issues, or is there some math rule I overlooked that prohibits directly sampling from the inner expectation to approximate?

I have found a related answer that implies that we actually do not need this lower bound in general Why is computing $\log p(x)$ difficult, but not the ELBO?

Here is first idea why Jensen's is necessary: Let's assume we want to optimize for example with batched stochastic gradient descent.

$$ \begin{align} & \mathbb{E}_{x \sim D} \log \mathbb{E}_{z \sim q(z|x)}[p(x|z) * \frac{p(z)}{q(z|x)}] \\ & \approx \frac{1}{N} \sum_{i = 0}^N \log \mathbb{E}_{z \sim q(z|x_i)}[p(x_i|z) * \frac{p(z)}{q(z|x_i)}] \\ & \approx \frac{1}{N} \sum_{i = 0}^N \log \frac{1}{M}\sum_{j=0}^{M} [p(x_i|z_j) * \frac{p(z_j)}{q(z_j|x_i)}] \end{align} $$

We can approximate the outer expectation over the data by sampling N times from the datasets and meaning the results. We do the same for the inner expectation and just assume that this is a correct approximation. However, if we for example sample the same sample multiple times from the outer expectation and only a single from the inner, we actually approximated the sum of log probabilities (so the result of applying Jensen's) instead of the log of a sum of probabilities.

EDIT1: If we increase M to infinity, we recover the original formulation. This should be what also is stated in Burda et al., 2015 in equation 10 (https://arxiv.org/pdf/1509.00519.pdf) At least these are my thoughts and maybe someone can show this formally instead of intuitively like I did.

EDIT2: I think the inverse of proof 1.3 in Burda et al. 2015 in Appendix A is what my original question aimed for. So if we start from the expectations and approximate expectation inside the log by the mean it must be a lower-bound on the original log p(x) since Jensen's applies. I will add this as an answer to this thread soon.

Please typeset mathematical expressions instead of uploading an image. It helps save server space. — Galen, May 10 '22 at 23:25
It’s also more accessible to people who visit Cross Validated using screen readers, which aren’t designed to parse images. — Arya McCarthy, May 11 '22 at 14:19

score 2 · Answer 1 · answered May 11 '22 at 14:31

2

Is there a reason why we need to apply Jensen's Inequality here? Can't we just sample approximate the term like it is?

There are two possibilities for what you’re asking here. I’ll address both.

Why do we need Jensen’s inequality?
To ensure that this is in fact a bound. If the optimization objective weren’t a bound, then there wouldn’t be much point in optimizing it. Speaking loosely, think of lifting a handful of sand. If it’s not a lower bound, sand slips through the gaps between your fingers.
Why do we need to bound the likelihood (using Jensen’s inequality) instead of optimizing it directly by a Monte Carlo method?
Well, we assume that the joint probability factors as $p(x,z)=p(z)p(x\mid z)$. That posterior is the (computationally) hard part. In general, we can’t assume that we can normalize it. But maybe we can sample from it? In fact, what you’re proposing is common! MCMC is often used to approximate the posterior. But if there’s a large amount of data or a complex model, that becomes very slow. That’s when people switch to VI.

answered May 11 '22 at 14:31

Arya McCarthy

8,787

Thank you for your answer. However, 1. does not really make sense for me. There is no intrinsic reason why we need a bound. The first sentence of 2. kinda shows this. Regarding 2., the hard part is solved by importance sampling with q(x) when marginalizing. So what makes Jensens and the bound actually necessary? – Tim Joseph May 11 '22 at 15:38
I think I now understand what you mean by 1.: q is not learned, because it does not influence the result by definition. Importance sampling gives us the idea that the posterior is the ideal sampler and thus, q should approximate it, but this does not give an intuition for Jensens inequality. – Tim Joseph May 11 '22 at 16:54
So in other words: I understand why Jensen's inequality can be used to derive the bound, but it seems like there is no intuition why one should do this. Imagine the first person (pre variational theory) that got so far and then thought, now I should apply Jensens. What was his/her intution? – Tim Joseph May 11 '22 at 17:41
You seem to be asking how advances in mathematics are made, more broadly. In what way is this different from the general question of how to approach a new proof? – Arya McCarthy May 11 '22 at 19:04
Not exactly. I am asking how this advancement has been made or even less, whether application of Jensen's inequality is merely a result of reshaping another derivation of the ELBO that has an intuition. For example derivation via the ideal sampler in importance sampling makes sense. We solve for 1. p(x) -> 2. we find the problem is Monte-Carlo sampling is slow -> 3. we use importance sampling -> 4. we search for the ideal sampler ->5. the ideal sampler is p(z|x) ->6. it is intractable -> 7. we use KL-divergence to approximate it For Jensen I only understand to 3. and dont know what comes – Tim Joseph May 11 '22 at 20:17
In this lecture https://www.youtube.com/watch?v=h0UE8FzdE8U&list=WL&index=1 ( Variational Inference and Deep Learning: An Intuitive Introduction ) Alex Lamb talks about a problem with the expectation in regards to multiple samples. Unfortunately, I was not able to understand what the problem actually is from the talk, but this is exactly what I am asking. – Tim Joseph May 12 '22 at 15:26

Why do we need Jensen inequality for variational autoencoders?

1 Answers1