Description
Paper
Link: https://arxiv.org/pdf/1911.03584.pdf
Year: 2020
Summary
- attention layers can perform convolution, they learn to behave similar to convolutional layers
- multi-head self-attention layer with sufficient number of heads is at least as expressive as any convolutional layer
This theorem holds for 1D and 2D convolution layers.
- similarity between convolution and multi-head self-attention is striking when the query pixel is slid over the image
Conclusion
We showed that self-attention layers applied to images can express any convolutional layer (given sufficiently many heads) and that fully-attentional models learn to combine local behavior (similar to convolution) and global attention based on input content. More generally, fully-attentional models seem to learn a generalization of CNNs where the kernel pattern is learned at the same time as the filters—similar to deformable convolutions (Dai et al., 2017; Zampieri, 2019). Interesting directions for future work include translating existing insights from the rich CNNs literature back to transformers on various data modalities, including images, text and time series.