Closed
Description
I'm having difficulty understanding a few aspects of the Seq2Seq transformer tutorial (https://pytorch.org/tutorials/beginner/transformer_tutorial.html)
- The tutorial says that it implements the architecture from Attention Is All You Need, but I don't see a TransformerDecoder used anywhere. It instead looks like only a TransformerEncoder is used. How does this example work without the decoder?
- The tutorial says that it uses a softmax to output probabilities over the dictionary, but I only see a linear output layer. Where is the softmax applied?
- Is this model learning to predict one word ahead (e.g. [hi how are you] -> [how are you doing])? I can't find the actual task described anywhere, only the inputs and targets in terms of an alphabet
Appreciate any help.
cc @pytorch/team-text-core @Nayef211