This repo contains the code for three papers:
- Feedback Transformer
- Expire-Span
- Staircase Transformer
The training code is structured for long sequential modeling with Transformer-like architectures.
You will need a CUDA-enabled GPU to run the code.
Run the following:
pip install -r requirements.txt
Introduced in Addressing Some Limitations of Transformers with Feedback Memory.
Model | Params | Valid | Test |
---|---|---|---|
Feedback Transformer | 77M | 0.984 | 0.962 |
Numbers are Bits-Per-Character
bash experiments/feedback/enwik8.sh
Model | 3 Variable | 5 Variable |
---|---|---|
Transformer | 33.7 | 37.5 |
Feedback Transformer | 99.1 | 92.6 |
Numbers are % Accuracy on Test
bash experiments/feedback/algorithmic_3var.sh
bash experiments/feedback/algorithmic_5var.sh
Introduced in Not All Memories are Created Equal: Learning to Expire.
Model | Params | Valid | Test |
---|---|---|---|
Expire-Span 12L | 38M | 1.014 | 0.994 |
Numbers are Bits-Per-Character
bash experiments/expire_span/enwik8.sh
Model | Maximum Span | Test Error (%) |
---|---|---|
Expire-Span | 16k | 52.2 |
Expire-Span | 32k | 36.7 |
Expire-Span | 64k | 26.7 |
bash experiments/expire_span/object_collision_16k.sh
bash experiments/expire_span/object_collision_32k.sh
bash experiments/expire_span/object_collision_64k.sh
Introduced in Staircase Attention for Recurrent Processing of Sequences. Note this algorithmic task in this repo is slightly different from what was used in the paper, while the number might not exactly match, it does show the same trend as in the paper. And the model implementation / hyperparameter remains the same.
Model | Test |
---|---|
Transformer | 58.44% |
Staircase Transformer | 3.6% |
Numbers are % error rate on Test
bash experiments/staircase/algorithmic_3var.sh
The code is licensed under CC-BY-NC license. See the LICENSE file for more details.