Skip to content

Big bird: Transformers for longer sequences #51

Open
@jinglescode

Description

@jinglescode

Paper

Link: https://proceedings.neurips.cc/paper/2020/file/c8512d142a2d849725f31a9a7a361ab9-Paper.pdf
Year: 2020

Summary

  • scaling up transformers to long sequences, by replacing full quadratic attention mechanism by a combination of random attention, window attention, and global attention
  • allow the processing of longer sequences, translating to state-of-the-art experimental results
  • theoretical guarantees of universal approximation and turing completeness

Methods

image

  • instead of full attention, they combination of 3 type of attention (like long-former + random attention)
    • select random attention from key and query (like random walk) - to save on weights but gain enough information
    • each i attend to itself and neighbors (i+1 and i-1), where a node can move to neighbor node and move along to next neighbor (like long-former, like convolution) - neighbor information is important
    • a global attention, where every node can send to the global node and then to the other node of interest (like the "CLS" token) - a global weight handler
  • need multiple layers to work, lots of engineering tricks

Results

  • do well enough in performance, mix results though

Comments

  • quite a lot of tricks here

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions