The paper Attention Is All You Need describes the transformer architecture that has an encoder and a decoder.
However, I wasn't clear on what the cost function to minimize is for such an architecture.
Consider a translation task, for example, where give an English sentence $x_{english} = [x_0, x_1, x_2, \dots, x_m]$, the transformer decodes the sentence into a French sentence $x_{french}' = [x_0', x_1', \dots, x_n']$. Let's say the true label is $y_{french} = [y_0, y_1, \dots, y_p]$.
What is the object function of the transformer? Is it the MSE between $x_{french}'$ and $y_{french}$? And does it have any weight regularization terms?
out_logitsvariable there, in the line that you're linking us to, i.e.tf.reshape(out_logits, [-1, VOCAB_SIZE]), so the cross-entropy seems to be computed between two vectors and not a matrix and a vector (but I didn't really executed that code and tried to output the shape of those tensors/variables). It may be a good idea to do it, just to confirm the exact shape. – nbro Dec 10 '20 at 10:21