Autoencoders: Where does the encoder end and the decoder begin?

Question

Consider a simple Autoencoder neural net:

from torch import nn
class AE(nn.Module):
    def init(self, x_dim, z_dim, h_dim=42):
        super().init()
        self.encoder = nn.Sequential(
            nn.Linear(x_dim, h_dim),
            nn.ReLU(),
            nn.Linear(h_dim, z_dim)
        )
        self.decoder = nn.Sequential(
            nn.Linear(z_dim, h_dim),
            nn.ReLU(),
            nn.Linear(h_dim, x_dim),
        )
def forward(self, x):
    z = self.encoder(x)
    x = self.decoder(z)
    return x

In popular literature, it is generally implied that the output of AE.encoder is solely responsible for encoding whereas AE.decoder is solely responsible for the decoding. Why is that?

If we consider that encoding is a more complex task than decoding, there is no no actual guarantee that the network wont use the first three layers for encoding and only the last for decoding (or vice versa). This might especially be the case if we consider asymmetrical autoencoder architectures.

Think both as a nonlinear transformations, as in mapping one mathematical space to another. It is true that for some problems decoder is more complex. — patagonicus, Jul 10 '22 at 02:55

score 1 · Answer 1 · answered Jul 15 '22 at 12:50

Computationally, there's not really any difference between the processing that happens in the encoder and in the decoder. Each layer maps its input to output, which (usually) has a different number of dimensions, using a linear transformation followed by a non-linear transformation (the activation function) - as @msuzen stated in the comment. Usually the first few layers reduce the dimensionality of the input and the last few layers increase the dimensionality. So we think of the smallest representation of the data as the encoded data, the layers before this as the encoder and the layers after as the decoder. But these are just labels we give to the different parts of the model.

Tim · Answer 2 · 2022-07-15T14:26:04.530

You are correct, the code

        self.encoder = nn.Sequential(
            nn.Linear(x_dim, h_dim),
            nn.ReLU(),
            nn.Linear(h_dim, z_dim)
        )
        self.decoder = nn.Sequential(
            nn.Linear(z_dim, h_dim),
            nn.ReLU(),
            nn.Linear(h_dim, x_dim),
        )

could be as well written as

        self.network = nn.Sequential(
            nn.Linear(x_dim, h_dim),
            nn.ReLU(),
            nn.Linear(h_dim, z_dim),
            nn.Linear(z_dim, h_dim),
            nn.ReLU(),
            nn.Linear(h_dim, x_dim),
        )

while working exactly the same. The point of autoencoders is however to create the encoded representation of the data. Autoencoder achieves this by training a network that given input data can re-create the same data as output. The most trivial case of such a function would be an identity function, the catch is however that autoencoders create an intermediate representation of the data, the encoding, that usually has lower dimensionality than the original data, hence compress it. That is why autoencoders when visualized look like a sandglass, with the layers that first get narrower (encoder) and then wider (decoder), as on the diagram below (image source).

So you are correct that any layer of the network holds some kind of representation of the data, but only the bottleneck one holds the most compressed representation desired by us.

However, the bottleneck is not the point here, the point is to have two networks: an encoder that creates a representation of the data and a decoder that translates it back to the original data. The encoding does not have to have lower dimensionality of the data (though in most cases such representation wouldn't be useful), nor there doesn't have to be any symmetry between encoder and decoder. It is about the produced encoding being useful for us. You are correct that a similar thing can often be achieved by extracting the representation of data created in earlier layers of a network trained for a completely different purpose as we do with re-using pre-trained networks, like using GloVe or word2vec for feature extraction.

Autoencoders: Where does the encoder end and the decoder begin?

2 Answers2