We follow the official code base of [fairseq] and implement Flowformer upon that repo.
Since fairseq a quite large code base, we only provide the changed module and our experimental configuration. You can incorporate flow_attention.py
to fairseq for reproduction.
Figure 1. Results on Wikitext-103.
- Solve the environment and download the dataset follows the tutorial of [Language Modeling].
- Replace the
./fairseq/modules/multihead_attention.py
by our providedflow_attention.py
. - Train and evaluate the model by the following scripts. You can get the pretrained model from [here].
fairseq-train --task language_modeling \
data-bin/wikitext-103 \
--save-dir checkpoints/flowformer \
--arch transformer_lm --share-decoder-input-output-embed \
--dropout 0.1 \
--optimizer adam --adam-betas '(0.9, 0.98)' --weight-decay 0.01 --clip-norm 0.0 \
--lr 0.001 --lr-scheduler inverse_sqrt --warmup-updates 6000 --warmup-init-lr 1e-07 \
--tokens-per-sample 512 --sample-break-mode none \
--max-tokens 2048 --update-freq 16 \
--max-update 150000
fairseq-eval-lm data-bin/wikitext-103 \
--path checkpoints/flowformer/checkpoint_best.pt \
--batch-size 2 \
--tokens-per-sample 512 \
--context-window 400
We code base is built upon on the official code of fairseq: