Open
Description
Paper
Link: https://arxiv.org/pdf/1901.10430.pdf
Year: 2019
Summary
- introduce dynamic convolutions which are simpler and more efficient than self-attention
- very lightweight convolution can perform competitively to the best reported self-attention results
- number of operations required by this approach scales linearly in the input length, whereas self-attention is quadratic
Contributions and Distinctions from Previous Works
- ability of self-attention to model long-range dependencies has recently come into question (Tang et al., 2018) and the unlimited context size is computationally very challenging due to the quadratic complexity in the input length. Furthermore, in practice long sequences require the introduction of hierarchies (Liu et al., 2018).
Methods
- Dynamic convolutions build on lightweight convolutions by predicting a different convolution kernel at every time-step
- similar to locally connected layers in the sense that the weights change at every position, however, the difference is that weights are dynamically generated by the model rather than fixed after training
- bears similarity to location-based attention which does not access the context to determine attention weights, however, we do not directly take the attention weights from the previous time-step into account
- Similar to self-attention, DynamicConv changes the weights assigned to context elements over time. However, the weights of DynamicConv do not depend on the entire context, they are a function of the current time-step only. Self-attention requires a quadratic number of operations in the sentence length to compute attention weights, while the computation of dynamic kernels for DynamicConv scales linearly in the sequence length.
Results
- better generalization compared to regular convolutions and that they can be competitive to state-of-the-art self-attention models
- large-scale machine translation, language modeling and abstractive summarization show that dynamic convolutions improve over strong self-attention models
- achieve 20% faster runtime than a highly-optimized self-attention baseline