Open
Description
Paper
Link: https://proceedings.neurips.cc/paper/2020/file/c8512d142a2d849725f31a9a7a361ab9-Paper.pdf
Year: 2020
Summary
- scaling up transformers to long sequences, by replacing full quadratic attention mechanism by a combination of random attention, window attention, and global attention
- allow the processing of longer sequences, translating to state-of-the-art experimental results
- theoretical guarantees of universal approximation and turing completeness
Methods
- instead of full attention, they combination of 3 type of attention (like long-former + random attention)
- select random attention from key and query (like random walk) - to save on weights but gain enough information
- each i attend to itself and neighbors (i+1 and i-1), where a node can move to neighbor node and move along to next neighbor (like long-former, like convolution) - neighbor information is important
- a global attention, where every node can send to the global node and then to the other node of interest (like the "CLS" token) - a global weight handler
- need multiple layers to work, lots of engineering tricks
Results
- do well enough in performance, mix results though
Comments
- quite a lot of tricks here