This repository contains the source code, pre-trained models, as well as instructions to reproduce results for our paper Time-aware Large Kernel Convolutions (ICML 2020).
TaLK Convolutions is a sequence modeling method that uses an adaptive convolution operation that learns to predict the size of a summation kernel instead of using a fixed-sized learnable kernel matrix. It utilizes a fast parallelized implementation of the summed-area table, also known as the integral image operation, to efficiently calculate the convolution output that uses the summation kernel. We generate relative offsets for each timestep of the input sequence, which are used to adaptively expand the size of the summation kernel conditioned on the input. This method yields a time complexity of O(n), effectively making the sequence encoding process linear to the number of tokens.
@inproceedings{lioutas2020timeaware,
author={Vasileios Lioutas and Yuhong Guo},
title={Time-aware Large Kernel Convolutions},
booktitle={Proceedings of the 37th International Conference on Machine Learning (ICML)},
year={2020}
}
- PyTorch version >= 1.3.1
- fairseq version >= 0.10.1
- Python version >= 3.6
- CUDA >= 10.1
- NVIDIA's apex library (for mixed-precision training)
git clone https://github.com/lioutasb/TaLKConvolutions.git
cd TaLKConvolutions
In order to support the parallelization of TaLK Convolutions, we have developed our own CUDA primitives. To install the kernels, use the commands below. We tested compiling the kernels using CUDA 10.1
but if a future CUDA release does not work, please feel free to open an issue.
cd talkconv/talkconv_module/
python setup.py install
We are welcoming contributions from experienced CUDA developers regarding making the CUDA kernels more efficient.
Dataset | Model | Prepared test set |
---|---|---|
IWSLT14 German-English | download (.pt) | IWSLT14 test: download (.zip) |
WMT16 English-German | download (.pt) | newstest2014: download (.zip) |
WMT14 English-French | download (.pt) | newstest2014: download (.zip) |
Please follow the instructions https://github.com/pytorch/fairseq/blob/master/examples/translation/README.md
to preprocess the data.
Training and evaluating TaLK Convolutions on a single GPU:
# Training
SAVE="checkpoints/talkconv_iwslt_deen"
mkdir -p $SAVE
CUDA_VISIBLE_DEVICES=0 \
fairseq-train data-bin/iwslt14.tokenized.de-en \
--user-dir talkconv/talkconv_fairseq \
--arch talkconv_iwslt_de_en \
--optimizer adam --fp16 --lr 0.0005 \
--source-lang de --target-lang en --max-tokens 4000 \
--min-lr '1e-09' --weight-decay 0.0001 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--lr-scheduler inverse_sqrt \
--dropout 0.3 --attention-dropout 0.1 --weight-dropout 0.1 \
--max-update 85000 --warmup-updates 4000 --warmup-init-lr '1e-07' \
--adam-betas '(0.9, 0.98)' --left-pad-source "False" --max-epoch 52 --seed 1024 \
--save-dir $SAVE
python utils/average_checkpoints.py --inputs $SAVE \
--num-epoch-checkpoints 10 --output "${SAVE}/model.pt"
# Evaluation
fairseq-generate data-bin/iwslt14.tokenized.de-en --user-dir talkconv/talkconv_fairseq \
--path "${SAVE}/model.pt" \
--batch-size 128 --beam 5 --remove-bpe --lenpen 1.6 --gen-subset test --quiet
Training and evaluating TaLK Convolutions on WMT16 En-De using cosine scheduler on one machine with 8 NVIDIA GPUs:
# Training
SAVE="checkpoints/talkconv_wmt_ende_big"
mkdir -p $SAVE
python -m torch.distributed.launch --nproc_per_node 8 fairseq-train \
data-bin/wmt16_en_de_bpe32k --fp16 --log-interval 100 --no-progress-bar --distributed-no-spawn \
--user-dir talkconv/talkconv_fairseq \
--max-update 30243 --share-all-embeddings --optimizer adam \
--adam-betas '(0.9, 0.98)' --clip-norm 0.0 --weight-decay 0.0 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--min-lr 1e-09 --update-freq 16 \
--ddp-backend=no_c10d --max-tokens 3584 \
--lr-scheduler cosine --warmup-init-lr 1e-7 --warmup-updates 10000 \
--lr-shrink 1 --max-lr 0.001 --lr 1e-7 --min-lr 1e-9 --warmup-init-lr 1e-07 \
--t-mult 1 --lr-period-updates 20000 \
--arch talkconv_wmt_en_de_big \
--save-dir $SAVE
# Checkpoint averaging
python utilss/average_checkpoints.py --inputs $SAVE \
--num-epoch-checkpoints 10 --output "${SAVE}/model.pt"
# Evaluation on newstest2014
CUDA_VISIBLE_DEVICES=0 \
fairseq-generate data-bin/wmt16_en_de_bpe32k --user-dir talkconv/talkconv_fairseq \
--path "${SAVE}/model.pt" \
--batch-size 128 --beam 4 --remove-bpe --lenpen 0.35 --gen-subset test > wmt14_gen_ende.txt
bash utils/compound_split_bleu.sh wmt14_gen_ende.txt
Training and evaluating TaLK Convolutions on WMT14 En-Fr using cosine scheduler on one machine with 8 NVIDIA GPUs:
# Training
SAVE="checkpoints/talkconv_wmt_enfr_big"
mkdir -p $SAVE
python -m torch.distributed.launch --nproc_per_node 8 fairseq-train \
data-bin/wmt14_en_fr --fp16 --log-interval 100 --no-progress-bar --distributed-no-spawn \
--user-dir talkconv/talkconv_fairseq \
--max-update 80000 --share-all-embeddings --optimizer adam \
--adam-betas '(0.9, 0.98)' --clip-norm 0.0 --weight-decay 0.0 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--min-lr 1e-09 --update-freq 32 \
--ddp-backend=no_c10d --max-tokens 1800 \
--lr-scheduler cosine --warmup-init-lr 1e-7 --warmup-updates 10000 \
--lr-shrink 1 --max-lr 0.001 --lr 1e-7 --min-lr 1e-9 --warmup-init-lr 1e-07 \
--t-mult 1 --lr-period-updates 70000 \
--arch talkconv_wmt_en_fr_big \
--save-dir $SAVE
# Checkpoint averaging
python utils/average_checkpoints.py --inputs $SAVE \
--num-epoch-checkpoints 10 --output "${SAVE}/model.pt"
# Evaluation
CUDA_VISIBLE_DEVICES=0 \
fairseq-generate data-bin/wmt14_en_fr --user-dir talkconv/talkconv_fairseq \
--path "${SAVE}/model.pt" \
--batch-size 128 --beam 6 --remove-bpe --lenpen 0.65 --gen-subset test --quiet
This project is MIT-licensed. The license applies to the pre-trained models as well.