Open
Description
Paper
Link: https://arxiv.org/abs/1912.10077
Year: 2020
Summary
- multi-head self-attention layers can indeed compute contextual mappings of the input sequences
- Transformers can represent any sequence-to-sequence functions, Transformers are universal approximators of continuous and permutation equivariant sequence-to-sequence functions with compact support