Transformer architecture question

Question

I am hand-coding a transformer (https://arxiv.org/pdf/1706.03762.pdf) based primarily on the instructions I found at this blog: http://jalammar.github.io/illustrated-transformer/.

The first attention block takes matrix input of the shape [words, input dimension] and multiplies by the attention weight matrices of shape [input dimension, model dimension]. The model dimension is chosen to be less than the input dimension and is the dimension used as output in all subsequent steps.

There is a residual connection around the attention block and the input is meant to be added to the output of the attention block. However the output of the attention block is shape [words, model dimension] and the input is form [words, input dimension]. Should I interpolate the input down to the model dimension as is done in ResNet? Or maybe add another weight matrix to transform the input?

score 0 · Accepted Answer · answered Jan 26 '21 at 20:30

0

The input dimensionality is the embedding size, which is the same as the model dimensionality, as explained in section 3.4 of the article:

3.4 Embeddings and Softmax

Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension $d_{model}$.

Therefore, the input dimensionality and the model dimensionality are the same, which makes them suitable for the residual connection.

answered Jan 26 '21 at 20:30

noe

26,410
1
46
76

Thank you for pointing this out. Indeed input dim = model dim. The concept I was missing was that the attention matrices can have a smaller dimension than the model dim. However, after they are concatenated they are multiplied by a matrix to have the original model dim again. – mkohler Jan 26 '21 at 22:26
I suggest to take a look to this answer for details on that matter. – noe Jan 26 '21 at 23:34

Transformer architecture question

1 Answers1