Is MSE loss a valid ELBO loss to measure?

Question

I am learning from an example given by TensorFlow document, https://www.tensorflow.org/tutorials/generative/cvae#define_the_loss_function_and_the_optimizer:

VAEs train by maximizing the evidence lower bound (ELBO) on the marginal log-likelihood.

In practice, optimize the single sample Monte Carlo estimate of this expectation: logp(x|z) + logp(z) - logq(z|x).

The loss function was implemented as:

def log_normal_pdf(sample, mean, logvar, raxis=1):
  log2pi = tf.math.log(2. * np.pi)
  return tf.reduce_sum(
      -.5 * ((sample - mean) ** 2. * tf.exp(-logvar) + logvar + log2pi),
      axis=raxis)
def compute_loss(model, x):
  mean, logvar = model.encode(x)
  z = model.reparameterize(mean, logvar)
  x_logit = model.decode(z)
  cross_ent = tf.nn.sigmoid_cross_entropy_with_logits(logits=x_logit, labels=x)
  logpx_z = -tf.reduce_sum(cross_ent, axis=[1, 2, 3])
  logpz = log_normal_pdf(z, 0., 0.)
  logqz_x = log_normal_pdf(z, mean, logvar)
  return -tf.reduce_mean(logpx_z + logpz - logqz_x)

Since this example used MINIST dataset, x can be normalized to [0, 1] and sigmoid_cross_entropy_with_logits was used here.

My question:

Another example used MSE loss (as follow), is MSE loss a valid ELBO loss to measure p(x|z)? Can we use other loss functions as a reconstruction loss in VAE, such as Huber loss (https://en.wikipedia.org/wiki/Huber_loss)?

https://www.tensorflow.org/guide/keras/custom_layers_and_models#putting_it_all_together_an_end-to-end_example

    # Iterate over the batches of the dataset.
    for step, x_batch_train in enumerate(train_dataset):
        with tf.GradientTape() as tape:
            reconstructed = vae(x_batch_train)
            # Compute reconstruction loss
            loss = mse_loss_fn(x_batch_train, reconstructed)
            loss += sum(vae.losses)  # Add KLD regularization loss

Sycorax · Accepted Answer · 2022-06-30T17:19:13.227

The Kingma et al. paper is very readable, and a good place to start understanding how and why VAEs work. Kingma, Diederik P., and Max Welling. "Auto-encoding variational Bayes." arXiv preprint arXiv:1312.6114 (2013).

"Another example used MSE loss (as follow), is MSE loss a valid ELBO loss to measure p(x|z)?"

Yes, MSE is a valid ELBO loss; it's one of the examples used in the paper. the authors write

We let $p_\theta(\bf{x}|\bf{z})$ be a multivariate Gaussian (in case of real-valued data) or Bernoulli (in case of binary data) whose distribution parameters are computed from $\bf{z}$ with a MLP (a fully-connected neural network with a single hidden layer, see appendix C).

In other words, we can use any $p_\theta(x|z)$ we like; we just need to implement a network to decode $z$ into $x$ and then measure the loss according to $p_\theta$.

Simple manipulations shows that minimizing MSE loss is the same as maximizing the joint probability of the gaussian density wrt the mean parameter. See: How do we get to the MSE in the loss function for a variational autoencoder?

"Can we use other loss functions as a reconstruction loss in VAE, such as Huber loss?"

Yes, choosing the Huber loss corresponds to replacing $p_\theta(\bf{x}|\bf{z})$ with another density, specifically the Huber density. For the Huber loss given by

$$ H_\alpha(x) = \begin{cases} \frac{1}{2} x^2 & | x | \le \alpha \\ \alpha \left(|x| - \frac{1}{2}\alpha \right) & | x | > \alpha \end{cases} $$ we can work backwards from the corresponding likelihood to show that the probability density implied by the Huber loss is given by

$$ p_\theta(y) \propto \exp \left(-H_\alpha(y)\right). $$

But knowing this fact isn't strictly necessary from a practical standpoint -- you can simply replace MSE with the Huber loss.

all feedback welcome · Answer 2 · 2022-06-30T17:16:24.740

As you point out, the ELBO is approximated by $\log p(x|z) + \log p(z) - \log q(z|x)$ with reparameterized samples from $q(z|x)$.

The first term, $\log p(x|z)$, is, in expectation over $q(z|x)$, often called the likelihood loss (reconstruction loss is a synonym in the context of VAEs). Such a loss, but without the expectation over $q(z|x)$, appears also in neural networks without latent variables $z$, in the form of $\log p(y|x)$, e.g. in classification/regression.

We have freedom over the choice of the modeled distribution $p(x|z)$, and depending on that distribution get a different loss function. The likelihood loss for a Bernoulli output distribution (per pixel) is sigmoid_cross_entropy_with_logits and for a Gaussian output distribution (per pixel) is the MSE loss.

The latter is because (up to an additive and a multiplicative constant) the quadratic error equals the log of the Gaussian likelihood/density. Similarly for other loss functions, also for the Huber loss, you can find a distribution $p(x|z)$ that corresponds to it. It is therefore completely reasonable to use any such loss functions.

However, the loss function should fit the output domain. If it's discrete, you shouldn't use a continuous loss function and vice versa. But with the Huber loss for continuous values, you are on the right track.

It's also not very elegant to use an infinite range output distribution / corresponding loss if the output is on a finite range like pixel values, but since all models are wrong to some extend anyway, no need to bother I guess, if your library doesn't offer something for finite ranges. Note that if your values are between 0 and 255, it's always possible to normalize them if that is convenient.

Is MSE loss a valid ELBO loss to measure?

My question:

2 Answers2