Subscript notation in expectations (variational autoencoder)

Question

This is the objective function of a variational autoencoder. I am not sure how to interpret the second term. It appears to be an expectation value over log p(x^(i)|z), but I'm not sure what role the subscript q(z|x^(i)) plays here. Thanks in advance.

microhaus · Accepted Answer · 2021-03-15T16:09:10.407

It means expectation with respect to $q_{\phi}(\mathbf{z} | \mathbf{x}^{(i)})$. So:

$$\mathbb{E}_{q_{\phi}(\mathbf{z} | \mathbf{x}^{(i)})}[\log p_{\theta}(\mathbf{x}^{(i)} | \mathbf{z})] = \int_{\mathbb{R}^d} q_{\phi}(\mathbf{z} | \mathbf{x}^{(i)}) \log p_{\theta}(\mathbf{x}^{(i)} | \mathbf{z}) d \mathbf{z} $$

Where without further information on the dimensionality of $\mathbf{z}$ I have assumed it to be in $\mathbb{R}^d$.

To further clarify, note that the underlying random vector/source of randomness is $\mathbf{z}$, of which you are computing the expectation of a function $f(\mathbf{z})$, where $f(\mathbf{z}) = \log p_{\theta}(\mathbf{x}^{(i)} | \mathbf{z})$. And this underlying source of randomness is captured in the distribution $q_{\phi}(\mathbf{z} | \mathbf{x}^{(i)})$.

Addressing comments.

In response to:

The part confuses me is that in $f(\mathbf{z})$, $\mathbf{z}$ plays the role of a condition which seems like it is fixed?

As a disclaimer, I've not yet read the paper "Auto-encoding variational Bayes" by Kingma and Welling in sufficient depth. I therefore cannot appropriately supply context-specific interpretations e.g. what is a decoder/encoder etc. However, that may not be necessary at this stage as I suspect the issue is not of a contextual nature.

I used the notation $f(\mathbf{z})$ without including other arguments, purely to indicate where the source of the randomness is coming from - there are other arguments in $f$ which I've neglected to mention.

'Decompressing' what is inside the expectation :

$$\log p_{\theta}(\mathbf{x}^{(i)} | \mathbf{z}) = \log p(\mathbf{x}^{(i)} | \mathbf{z}; \theta) = \log p(\mathbf{x} | \mathbf{z}; \theta) \left. \right|_{\mathbf{x} = \mathbf{x}^{(i)}}$$

Where $\left. \right|_{\mathbf{x} = \mathbf{x}^{(i)}}$ means 'evaluated at $\mathbf{x} = \mathbf{x}^{(i)}$', and I have used the semi-colon to indicate that the fixed, but unknown parameter $\theta$ parametrises the conditional distribution, and that it is not being treated as a random variable (at least in the 1st part of the paper).

Consider the function $f(\mathbf{x}, \mathbf{z}, \theta) = \log p(\mathbf{x} | \mathbf{z}; \theta)$.

Now $f$ is a completely deterministic function - if I input a value of the observed data $\mathbf{x} = \mathbf{x}^{(i)}$, a value of the latent variable $\mathbf{z} = \mathbf{z}^{(i)}$, and a value of the parameter $\theta = \theta_0$, it will output a fixed number i.e. the log-conditional density of observing $\mathbf{x}^{(i)}$, given that the latent variable is observed to be $\mathbf{z}^{(i)}$, for a particular parameter value $\theta_0$.

Now consider the case where we fix the random variables $\mathbf{x} = \mathbf{x}^{(i)}$ and $\mathbf{z} = \mathbf{z}^{(i)}$, but where we don't know $\theta$. In this case, $f(\mathbf{x}^{(i)}, \mathbf{z}^{(i)}, \theta)$ can only freely vary in $\theta$, and is still deterministic. This is because we have fixed the random variable $\mathbf{x}$ using the observed data $\mathbf{x}^{(i)}$, and fixed the random variable $\mathbf{z}$ by conditioning on an observation of the latent variable $\mathbf{z}^{(i)}$. I suspect that this what you might be thinking of when you say "$\mathbf{z}$ plays the role of a condition which seems like it is fixed". This is not the situation we are in.

The situation we are in is $f(\mathbf{x}^{(i)}, \mathbf{z}, \theta) = \log p(\mathbf{x}^{(i)} | \mathbf{z}; \theta)$. Here, the training data, having been observed is fixed at $\mathbf{x} = \mathbf{x}^{(i)}$, but now $f(\mathbf{x}^{(i)}, \mathbf{z}, \theta)$ can freely vary in both the parameter $\theta$ and also in the latent variable $\mathbf{z}$. Additionally, the output of this function is now random, and this is due solely to one of its inputs, the latent variable $\mathbf{z}$, being random. Hence the key distinction to be aware of is that we are not conditioning on an observation of the latent variable $\mathbf{z}^{(i)}$, rather, conditioning on the latent random variable $\mathbf{z}$.

The key distinction I believe you are overlooking is that of conditioning on a random variable, and conditioning on an observed value of a random variable.

Now what you are doing when taking expectation with respect to $q(\mathbf{z} | \mathbf{x}^{(i)}; \phi)$ is that you are 'averaging out the randomness' of the unknown latent variable $\mathbf{z}$ altogether. Meaning that computing the expectation

$$\begin{align} \mathbb{E}_{q(\mathbf{z} | \mathbf{x}^{(i)}; \phi)}[\log p_{\theta}(\mathbf{x}^{(i)} | \mathbf{z})] &= \int_{\mathbb{R}^d} q(\mathbf{z} | \mathbf{x}^{(i)}; \phi) \log p(\mathbf{x}^{(i)} | \mathbf{z}; \theta) d \mathbf{z} \\ &= h(\phi, \theta) \end{align}$$

Will give you a deterministic function $h$ that can only vary in the model parameter $\theta$ and the variational parameter $\phi$, both of which, in the initial parts of the paper, are not treated as random variables, rather global parameters we'd like to estimate.

As a sanity check, if you go back to the main equation, note the evidence lower bound evaluated at training data point $i$ is denoted as $\mathcal{L}(\theta, \phi; \mathbf{x}^{(i)})$, and the fact that as specified, it can only freely vary in $\theta$ and $\phi$.

Thank you for the reply. The part confuses me is that in f(z) 'z' plays the role of a condition which seems like it is fixed? — Sam, Mar 15 '21 at 04:15

Subscript notation in expectations (variational autoencoder)

1 Answers1