Skip to content

On the relationship between self-attention and convolutional layers #22

Open
@jinglescode

Description

@jinglescode

Paper

Link: https://arxiv.org/pdf/1911.03584.pdf
Year: 2020

Summary

  • attention layers can perform convolution, they learn to behave similar to convolutional layers
  • multi-head self-attention layer with sufficient number of heads is at least as expressive as any convolutional layer

image

This theorem holds for 1D and 2D convolution layers.

  • similarity between convolution and multi-head self-attention is striking when the query pixel is slid over the image

Conclusion

We showed that self-attention layers applied to images can express any convolutional layer (given sufficiently many heads) and that fully-attentional models learn to combine local behavior (similar to convolution) and global attention based on input content. More generally, fully-attentional models seem to learn a generalization of CNNs where the kernel pattern is learned at the same time as the filters—similar to deformable convolutions (Dai et al., 2017; Zampieri, 2019). Interesting directions for future work include translating existing insights from the rich CNNs literature back to transformers on various data modalities, including images, text and time series.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions