2

I have built the following function which takes as input some data and runs a VAE on them:

def VAE(data, original_dim, latent_dim, test_size, epochs):
x_train, x_test = train_test_split(data, test_size=test_size, random_state=42)

# Define the VAE architecture
#Encoder
encoder_inputs = tf.keras.Input(shape=(original_dim,))
x = layers.Dense(64, activation='relu')(encoder_inputs)
x = layers.Dense(32, activation='relu')(x)
x = layers.Dense(8, activation='relu')(x)

#--- Custom Latent Space Layer
z_mean = layers.Dense(units=latent_dim, name='Z-Mean', activation='linear')(x)
z_log_sigma = layers.Dense(units=latent_dim, name='Z-Log-Sigma', activation='linear')(x)
z = layers.Lambda(sampling, name='Z-Sampling-Layer')([z_mean, z_log_sigma, latent_dim]) # Z sampling layer

# Instantiate the encoder
encoder = tf.keras.Model(encoder_inputs, [z_mean, z_log_sigma, z], name='encoder')

#Decoder
latent_inputs = tf.keras.Input(shape=(latent_dim,))
x = layers.Dense(8, activation='relu')(latent_inputs)
x = layers.Dense(32, activation='relu')(x)
x = layers.Dense(64, activation='relu')(x)
decoder_outputs = layers.Dense(1, activation='relu')(x)

# Instantiate the decoder
decoder = tf.keras.Model(latent_inputs, decoder_outputs, name='decoder')

# Define outputs from a VAE model by specifying how the encoder-decoder models are linked
# Instantiate a VAE model
vae = tf.keras.Model(inputs=encoder_inputs, outputs=decoder(encoder(encoder_inputs)[2]), name='vae')

# Reconstruction loss compares inputs and outputs and tries to minimise the difference
r_loss = original_dim * tf.keras.losses.mse(encoder_inputs, decoder(encoder(encoder_inputs)[2]))  # use MSE

# KL divergence loss compares the encoded latent distribution Z with standard Normal distribution and penalizes if it's too different
kl_loss = -0.5 * K.mean(1 + z_log_sigma - K.square(z_mean) - K.exp(z_log_sigma), axis=-1)

#VAE total loss
vae_loss = K.mean(r_loss + kl_loss)

# Add loss to the model and compile it
vae.add_loss(vae_loss)
vae.compile(optimizer='adam')

# train the model
vae.fit(x_train, x_train, epochs=epochs, validation_data=(x_test, x_test))

where

def sampling(args):
z_mean, z_log_sigma, latent_dim = args
epsilon = K.random_normal(shape=(K.shape(z_mean)[0], latent_dim), mean=0., stddev=1., seed=42)
return z_mean + K.exp(z_log_sigma) * epsilon

My question is, if I want to generate new data, by using the above VAE how can I achieve that ?

If I want to sample 100 new data, should I use this

   latent_mean = tf.math.reduce_mean(encoder(x_train)[2], axis=0) 
   latent_std = tf.math.reduce_std(encoder(x_train)[2], axis=0)
   tf.random.normal(shape=(100, latent_dim), mean=latent_mean, stddev=latent_std)

or

   latent_mean = tf.math.reduce_mean(encoder(x_train)[0], axis=0) 
   latent_std = tf.math.exp(tf.math.reduce_mean(encoder(x_train)[1], axis=0))
   tf.random.normal(shape=(100, latent_dim), mean=latent_mean, stddev=latent_std)

?

Basically, should I use the z_mean and z_log_sigma directly ? Or should I infer them from the z ?

Ciodar
  • 475
quant
  • 511
  • 5
  • 12
  • The outputs of normal will be the same when the inputs are the same. But we have good reason to expect z_log_sigma to be on the log scale (your code uses exponentiation: K.exp(z_log_sigma)), and therefore different than np.std(z). – Sycorax Jan 20 '23 at 14:30
  • You are right, that was a typo. I editted my question – quant Jan 20 '23 at 14:52
  • The question remains though – quant Jan 20 '23 at 14:53
  • I don't see any evidence of an edit (there's no edit history). But even if you switch normal(mean=z_mean, stddev=z_log_sigma) to normal(mean=z_mean, stddev=np.exp(z_log_sigma)), the output will still be different from normal(mean=np.mean(z), stddev=np.std(z)) because the two calls have different inputs: the latter has scalar mean and standard deviation, but the former has an array. The documentation explains the distinction. – Sycorax Jan 20 '23 at 15:12
  • I see the confusion. I tried to oversimplify the question by removing the dimensions. I removed the confusing part. The bottomline of the question is if I should use directly the z_mean and z_log_sigma, or if I should infer them from Z. I do not understand the difference between these 2 – quant Jan 20 '23 at 15:18

1 Answers1

2

At a high level of abstraction, there are three steps to VAEs. (The most common use-case of a VAE assumes that the $z$ are distributed as a multivariate normal distribution with a diagonal covariance matrix, so I focus on that.)

  1. Encode. Use a neural network $f$ to map the input $x$ to a vector means and a vector of standard deviations: $f(x) = (\mu, \sigma)$
  2. Sample. Draw a vector from the multivariate normal distribution defined by $(\mu, \sigma)$. This is $z \sim \mathcal{N}(\mu, \sigma^2 I)$.
  3. Decode. Use a neural network to map the vector $z$ to something that "looks like" the input $x$. We could write this as $\hat x = g(z)$.

During, the loss is computed as the sum of the KLD and the reconstruction error: $\| \hat x - x \|_2^2 + KLD$, for example.

If you're trying to generate new data that "looks like" the input data, then you want $\hat x$. So neither of your proposals is correct -- you want to decode $z$.

Sycorax
  • 90,934
  • If I understand correctly, then I should do the following: latent_sample = tf.random.normal(shape=(100, latent_dim), mean=latent_mean, stddev=latent_std) and then generated_data = decoder(latent_sample), right ? But then which latent_mean and latent_std should I use ? – quant Jan 20 '23 at 15:49
  • 1
    To draw the latent variable $z$, you use the mean and standard deviation obtained from the encoder, z_mean and exp(z_log_sigma) – Sycorax Jan 20 '23 at 15:56
  • I do not understand what is the difference between tf.math.reduce_mean(encoder(x_train)[2], axis=0) (the mean of the latent vector z) and tf.math.reduce_mean(encoder(x_train)[0], axis=0) (the mean of the encoder).

    I mean, isnt the distribution of $z$ as close as possible to the distribution of the input data, due to the KL divergence ? So, shouldnt these two be similar ? I guess I am missing something here

    – quant Jan 20 '23 at 16:03
  • $z$ is the latent variable, an encoding of the input, but there's no requirement that the encoding has to be close to $x$. The whole point of the decoder is to transform $z$ into $\hat x$, and the reconstruction loss function tells you how close the reconstruction $\hat x$ is to the input $x$. You can make the KL divergence exactly 0 if you use a $z$ that has nothing to do with the input, but does exactly match your desired distribution. The VAE loss is KLD plus reconstruction loss. Reconstruction loss is usually something like $| \hat x - x |_2^2$. – Sycorax Jan 20 '23 at 16:11
  • But isnt the desired distribution, the distribution of the original data ? I am confused – quant Jan 20 '23 at 23:13
  • 1
    It seems that you do not have a strong understanding of VAEs. This is a good thread about what VAEs are and how they work. https://stats.stackexchange.com/questions/321841/what-are-variational-autoencoders-and-to-what-learning-tasks-are-they-used/328181#328181 – Sycorax Jan 20 '23 at 23:18