Description
Paper
Link: https://arxiv.org/pdf/2007.04825.pdf
Year: 2020
Summary
- improve the computational complexity of self-attention. by cluster the queries into non-overlapping clusters
Methods
different from Reformer:
Recently, Kitaev et al. [14] introduced Reformer. A method that groups positions based on their similarity using locality-sensitive hashing (LSH) and only computes the attention within groups. For groups of fixed size, the asymptotic complexity of Reformer becomes linear with respect to the sequence length. Note that Reformer constrains the queries and keys of self-attention to be equal. As a result, it cannot be applied to neural machine translation, image captioning or memory networks, or generally any application with heterogenous queries and keys. In addition, as it uses hash collisions to form groups it can only handle a small number of bits, thus significantly reducing the quality of the grouping. Instead, our method uses clustering to group the queries, resulting in significantly better groups compared to hash collisions.
- Reformer is a sparse attention mechanism, queries/keys are grouped together and attention is computed only within groups. In contrast, clustered attention groups queries and uses centroids to compute attention over all keys (dense attention) as an approximation of true attention. We provide a bound on the quality of this approximation.
- Reformer sets queries to be same as keys, i.e.,
$Q = K$ . This is done to ensure that the grouped queries (and keys) have high dot product (maximum inner product search). This prohibits its usage when queries can't be set equal to keys. In contrast, clustered attention doesn't place any such constraint. - We also propose improved-clustered attention which explicitly recomputes attention for each query on top-k keys. The centroids are used to identify a small set of keys with highest attention scores as top-k keys. We prove that this always improves attention approximation.
- Finally, we compare against Reformer and also show that clustered attention is backward compatible, i.e., we can use it with pretrained transformers without any retraining.
Results
- for short sequences, improved-clustered attention is slower than full attention, however still better in terms of memory
- improved-clustered is a good approximation of the full attention, we can get the best of both softmax and improved-clustered using the attention composition available in fast-transformers