Open
Description
Paper
Link: https://arxiv.org/abs/2006.04862
Year: 2020
Summary
- this paper proofs that sparse transformers can approximate the same as the dense counterpart for any sequence to sequence function
Results
- Except for the STAR and RANDOM patterns, we can see that the networks learn to copy the sequences with four sparse attention layers
- STRIDED pattern and the STAR achieve the best performance across all sparsity levels for language modeling
- STRIDED and FIXED patterns with UNION configuration show the best scores for translation task
- In all tasks, the RANDOM pattern performs worse than the deterministic patterns, demonstrating the need for a careful design of sparsity patterns. Overall, our experiments suggest that the design of the optimal sparsity patterns is heavily dependent on specific tasks. For example, the STAR pattern shows the best performance on the language modeling task, while having trouble with copying, translation, and BERT experiments