Skip to content

An image is worth 16x16 words: Transformers for image recognition at scale #50

Open
@jinglescode

Description

@jinglescode

Paper

Link: https://arxiv.org/pdf/2010.11929.pdf
Year: 2020

Summary

  • global image attention by patches
  • learn to attend to patches further away at the lower layers which convnet cannot

Contributions and Distinctions from Previous Works

  • like transformers have replaced LSTM in NLP tasks they work attempt to replace conv with attentions
    • conv net has good inductive priors (or inductive bias), where it has good feature extractions with immediate neighbors, which is sensible for image data

Methods

  • split images into patches, unroll the patches as a sequence of patches,

image

Results

  • require lesser computation power to train than huge convnet
  • learns similar filters to traditional conv filters

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions