Skip to content

Compressive transformers for long-range sequence modelling #40

Open
@jinglescode

Description

@jinglescode

Paper

Link: https://arxiv.org/abs/1911.05507
Year: 2019

Summary

  • compress memory mechanism to compress past memories for long-range sequence learning

image

Contributions and Distinctions from Previous Works

  • focus on longer memory while reducing memory

Methods

  • like vanilla Transformer - uses multi-head attention to propagate information over time
  • from TransformerXL - maintains a memory of past activations at each layer to preserve a longer history of context. Compressive Transformer is to compress these old
    memories, instead of discarding them, and store them in an additional compressed memory.
  • learning rate schedule with a linear warmup from 1e-6 to 3e-4 and a cosine decay back down to 1e-n6

Results

  • obtain a maximum temporal range that is two times greater than the TransformerXL with an identical attention cost.
  • obtains state-of-the-art language modelling results in the WikiText-103 and Enwik8 benchmarks, achieving 17.1 ppl and 0.97 bpc respectively
  • model the waveform of high-frequency speech, outperform the TransformerXL and maintain
    a slim advantage over WaveNet
  • can be used as a memory component within an RL agent

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions