Skip to content

Rethinking Attention with Performers #45

Open
@jinglescode

Description

@jinglescode

Paper

Link: https://arxiv.org/abs/2009.14794
Year: 2020

Summary

  • proposes a set of techniques called Fast Attention Via positive Orthogonal Random features (FAVOR+) to approximate softmax self attention in Transformers and achieve better space and time complexity when the sequence length is much higher than feature dimensions
  • Performers, is provably and practically accurate in estimating regular full-rank attention without relying on any priors such as sparsity or low-rankness. It can also be applied to efficiently model other kernalizable attention mechanisms beyond softmax, achieving better empirical results than regular Transformers on some datasets with such strong representation power
  • tested on a rich set of tasks including pixel-prediction, language modeling and protein sequence modeling, and demonstrated competitive results with other examined efficient sparse and dense attention models

Contributions and Distinctions from Previous Works

  • this work aim to reduce Transformers’ space complexity

Methods

image

Results

image

Comments

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions