Assume we have a Transformer (Attention is all you need paper) and we give to it an input sequence S of length $n_{words}$.
If no padding is applied, the output of the encoder model would be a matrix of shape $ℝ^{n_{words}\text{ x }d_{model}}$, rigth? See figure below: 
But this is a problem, since the output of the Masked Multi-Head Attention is a matrix whose dimensions depend on the Output (shifted right).
How is this solved?
a) are inputs padded?
b) has Output (shifted right) the same dimension of the input?
c) am I missing something else?
If you could point out some resources where this problem is explained / some pieces of code that would be great!
Many thanks