5

Assume we have a Transformer (Attention is all you need paper) and we give to it an input sequence S of length $n_{words}$.

If no padding is applied, the output of the encoder model would be a matrix of shape $ℝ^{n_{words}\text{ x }d_{model}}$, rigth? See figure below: Transformer structure

But this is a problem, since the output of the Masked Multi-Head Attention is a matrix whose dimensions depend on the Output (shifted right).

How is this solved?

a) are inputs padded?

b) has Output (shifted right) the same dimension of the input?

c) am I missing something else?

If you could point out some resources where this problem is explained / some pieces of code that would be great!

Many thanks

alle
  • 71

1 Answers1

2

You do NOT need to pad inputs and outputs to have the same length.

The interaction between input and output occurs when you compute attention in the second sublayer of the decoder:

$Q(K^TV)$ , where $\begin{equation} Q \in \mathbb{R}^{d_{output} \times d_{model}} \\ K \in \mathbb{R}^{d_{input} \times d_{model}} \\ V \in \mathbb{R}^{d_{input} \times d_{model}} \\ \end{equation}$

$K$ and $V$ are from the encoder (inputs) and will produce a $d_{model} \times d_{model}$ matrix that you can multiply with $Q$ from the decoder.

waxalas
  • 76
  • I agree that that way you get the right dimension. But this way you are extracting the values from the Encoder output. Shouldn't we use decoder's as the "information retrieval database". Look at this answer – alle Jan 16 '23 at 10:34