Open
Description
Paper
Link: https://arxiv.org/abs/1901.02860
Year: 2019
Summary
- enables learning dependency beyond a fixed length without disrupting temporal coherence
- resolves the context fragmentation problem
Contributions and Distinctions from Previous Works
- Transformers have a potential of learning longer-term dependency, but are limited by a fixed-length context in the setting of language modeling.
Methods
- main technical contributions include introducing the notion of recurrence in a purely selfattentive model and deriving a novel positional encoding scheme
Results
- TransformerXL learns dependency that is 80% longer than RNNs and 450% longer than vanilla Transformers
- achieves better performance on both short and long sequences
- experiments on enwiki8, Transformer-XL is up to 1,800+ times faster than the vanilla model during evaluation
- able to generate relatively coherent long text articles with thousands of tokens
- first self-attention model that achieves substantially better results than RNNs on both character-level and word-level language modeling