0

I am hand-coding a transformer (https://arxiv.org/pdf/1706.03762.pdf) based primarily on the instructions I found at this blog: http://jalammar.github.io/illustrated-transformer/.

The first attention block takes matrix input of the shape [words, input dimension] and multiplies by the attention weight matrices of shape [input dimension, model dimension]. The model dimension is chosen to be less than the input dimension and is the dimension used as output in all subsequent steps.

There is a residual connection around the attention block and the input is meant to be added to the output of the attention block. However the output of the attention block is shape [words, model dimension] and the input is form [words, input dimension]. Should I interpolate the input down to the model dimension as is done in ResNet? Or maybe add another weight matrix to transform the input?

enter image description here

mkohler
  • 23
  • 4

1 Answers1

0

The input dimensionality is the embedding size, which is the same as the model dimensionality, as explained in section 3.4 of the article:

3.4 Embeddings and Softmax

Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension $d_{model}$.

Therefore, the input dimensionality and the model dimensionality are the same, which makes them suitable for the residual connection.

noe
  • 26,410
  • 1
  • 46
  • 76
  • Thank you for pointing this out. Indeed input dim = model dim. The concept I was missing was that the attention matrices can have a smaller dimension than the model dim. However, after they are concatenated they are multiplied by a matrix to have the original model dim again. – mkohler Jan 26 '21 at 22:26
  • I suggest to take a look to this answer for details on that matter. – noe Jan 26 '21 at 23:34