Skip to content

Reformer: The efficient transformer #41

Open
@jinglescode

Description

@jinglescode

Paper

Link: https://arxiv.org/abs/2001.04451
Year: 2020

Summary

  • more memory efficient and faster on long sequences

Contributions and Distinctions from Previous Works

  • training Large Transformer models can be prohibitively costly, especially on long sequences
  • introduce two techniques to improve the efficiency of Transformers
    • replace dot-product attention by one that uses locality-sensitive hashing
    • use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of N times, where N is the number of layers
  • Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences

Methods

  • Reversible layers, first introduced in Gomez et al. (2017), enable storing only a single copy
    of activations in the whole model, so the N factor disappears.
  • Splitting activations inside feed-forward layers and processing them in chunks removes the
    df f factor and saves memory inside feed-forward layers.
  • Approximate attention computation based on locality-sensitive hashing replaces the O(L^2) factor in attention layers with O(Llog L) and so allows operating on long sequences.

Results

  • Reformer combines the modeling capacity of a Transformer with an architecture that can be executed efficiently on long sequences and with small memory use even for models with a large number of layers

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions