Skip to content

Pay Less Attention with Lightweight and Dynamic Convolutions #28

Open
@jinglescode

Description

@jinglescode

Paper

Link: https://arxiv.org/pdf/1901.10430.pdf
Year: 2019

Summary

  • introduce dynamic convolutions which are simpler and more efficient than self-attention
  • very lightweight convolution can perform competitively to the best reported self-attention results
  • number of operations required by this approach scales linearly in the input length, whereas self-attention is quadratic

Contributions and Distinctions from Previous Works

  • ability of self-attention to model long-range dependencies has recently come into question (Tang et al., 2018) and the unlimited context size is computationally very challenging due to the quadratic complexity in the input length. Furthermore, in practice long sequences require the introduction of hierarchies (Liu et al., 2018).

image

Methods

  • Dynamic convolutions build on lightweight convolutions by predicting a different convolution kernel at every time-step
  • similar to locally connected layers in the sense that the weights change at every position, however, the difference is that weights are dynamically generated by the model rather than fixed after training
  • bears similarity to location-based attention which does not access the context to determine attention weights, however, we do not directly take the attention weights from the previous time-step into account
  • Similar to self-attention, DynamicConv changes the weights assigned to context elements over time. However, the weights of DynamicConv do not depend on the entire context, they are a function of the current time-step only. Self-attention requires a quadratic number of operations in the sentence length to compute attention weights, while the computation of dynamic kernels for DynamicConv scales linearly in the sequence length.

Results

  • better generalization compared to regular convolutions and that they can be competitive to state-of-the-art self-attention models
  • large-scale machine translation, language modeling and abstractive summarization show that dynamic convolutions improve over strong self-attention models
  • achieve 20% faster runtime than a highly-optimized self-attention baseline

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions