Open
Description
Paper
Link: https://openreview.net/forum?id=rJe4ShAcF7
Year: 2018
Summary
- relative attention is very well-suited for generative modeling of symbolic music
- relative attention to much longer sequences such as long texts or even audio waveforms
Contributions and Distinctions from Previous Works
- Transformers unable to perform long sequence like music
Methods
- take a language-modeling approach to training generative models for symbolic music. Hence we represent music as a sequence of discrete tokens, with the vocabulary determined by the dataset. Datasets in different genres call for different ways of serializing polyphonic music into a single stream and also discretizing time
- perform "skewing" for a memory efficient implementation of relative position based attention
Results
- relative self-attention mechanism, dramatically reducing its memory
requirements from O(L^2D) to O(LD). For example, the memory consumption per layer is reduced from 8.5 GB to 4.2 MB (per head from 1.1 GB to 0.52 MB) for a sequence of length L = 2048 and hidden-state size D = 512 - perceived as more coherent than the baseline Transformer model
- generalize and generate in consistent fashion beyond the length it was trained on