Open
Description
Paper
Link: https://arxiv.org/abs/2001.04451
Year: 2020
Summary
- more memory efficient and faster on long sequences
Contributions and Distinctions from Previous Works
- training Large Transformer models can be prohibitively costly, especially on long sequences
- introduce two techniques to improve the efficiency of Transformers
- replace dot-product attention by one that uses locality-sensitive hashing
- use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of N times, where N is the number of layers
- Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences
Methods
- Reversible layers, first introduced in Gomez et al. (2017), enable storing only a single copy
of activations in the whole model, so the N factor disappears. - Splitting activations inside feed-forward layers and processing them in chunks removes the
df f factor and saves memory inside feed-forward layers. - Approximate attention computation based on locality-sensitive hashing replaces the O(L^2) factor in attention layers with O(Llog L) and so allows operating on long sequences.
Results
- Reformer combines the modeling capacity of a Transformer with an architecture that can be executed efficiently on long sequences and with small memory use even for models with a large number of layers