Many language generation models have different settings, such as vocabularies, tokenization, and generation order, so they can't be simply ensembled. Our Twist decoding combines models regardless of such differences without any additional training or finetuning.
We forked the fairseq library and incorporated distance terms to their beam implementation. You can incorporate this in any implementation of beam search, but here we provide the codebase that we used for our paper. To run experiments, follow the fairseq instructions and run in this repository:
cd fairseq
pip install --editable .
python setup.py build_ext --inplace
Any fairseq seq-to-seq model should work, but here we provide all models we used in our experiments. See our paper for the training details.
Models | ||||
---|---|---|---|---|
DE-EN Generic1 | DE-EN Medicine | DE-EN Law | DE-EN Koran | DE-EN Subtitles |
ZH-EN L2R | ZH-EN R2L | EN-DE L2R | EN-DE R2L | |
SciTLDR Abstract2 | SciTLDR AIC2 |
1: WMT19 top-performing model. Downloaded from the fairseq repository.
2: Downloaded from the official repository of the SciTLDR dataset (Cachola et al., 2020).
Datasets | |||||
---|---|---|---|---|---|
DE-EN Medicine3 | DE-EN Law3 | DE-EN Koran3 | EN-DE Subtitles3 | WMT20 ZH-EN4 | WMT20 EN-DE4 |
3: Downloaded from the official repository of Hu et al. (2019).
4: Downloaded from the official repository of the bidimensional leaderboards (Kasai et al., 2022).
Here are some example commands.
Run Twist decoding with f=Domain
and g=Generic
in the medical domain.
They are separated by a colon in options: f:g
.
Run Moses detokenization after.
cd fairseq/
python twist/generate_twist.py --model-dirs <PATH>/trans-base_medicine-de-en/:<PATH>/wmt19.de-en.joined-dict/ --model-names model.pt:model.pt --out-file mt/domains/medicine/output/test.twist --r2l 0:0 --src-lang de --tgt-lang en --in-file mt/domains/medicine/src/emea-test.tok.de --batch-size 20 --max-updates 3 --lmd-g 0.3 --lmd-f 0.1
perl <PATH>/mosesdecoder/scripts/tokenizer/detokenizer.perl -l en < mt/domains/medicine/output/test.twist_update-2.out > mt/domains/medicine/output/test.twist_update-2.txt
Run Twist decoding with f=Generic
and g=Domain
in the legal domain.
python twist/generate_twist.py --model-dirs <PATH>/wmt19.de-en.joined-dict/:<PATH>/trans-base_law-de-en/ --model-names model.pt:model.pt --out-file mt/domains/law/output/test.twist --r2l 0:0 --src-lang de --tgt-lang en --in-file mt/domains/law/src/acquis-test.tok.de --batch-size 20 --max-updates 3 --lmd-g 3.0 --lmd-f 0.1
Run the reranking baseline.
python twist/generate_rerank.py --model-dirs <PATH>/trans-base_medicine-de-en/:<PATH>/wmt19.de-en.joined-dict/ --model-names model.pt:model.pt --out-file mt/domains/medicine/output/test.rerank.out --r2l 0:0 --src-lang de --tgt-lang en --in-file mt/domains/medicine/src/emea-test.tok.de --batch-size 20
The command is similar, but we pass the --r2l
option.
python twist/generate_twist.py --model-dirs <PATH>/trans-large-r2l_wmt20-zh-en/:<PATH>/trans-large-l2r_wmt20-zh-en/ --model-names model.pt:model.pt --out-file mt/wmt/zh-en/output/test.twist --r2l 1:0 --src-lang zh --tgt-lang en --in-file mt/wmt/zh-en/src/newstest2020.zh-en.src.tok.zh --max-updates 3 --lmd-g 3.0 --lmd-f 0.1 --batch-size 20
Here are some example commands.
Run Twist decoding with f=AIC
(abstract, introduction, and conclusion) and g=Abstract
.
python twist/generate_twist_tldr.py --checkpoint-dirs <PATH>/scitldr_catts-xsum.tldr-aic/:<PATH>/scitldr_bart.tldr-ao/ --data-dirs summ/scitldr/SciTLDR-AIC/ctrl:summ/scitldr/SciTLDR-A/ctrl --checkpoint-files scitldr_catts-xsum.tldr-aic.pt:scitldr_bart.tldr-ao.pt --max-updates 3 --batch-size 1 --split test --beam 5 --lmd-g 3.0 --lmd-f 0.3 --batch-size 1 --out-file summ/scitldr/output/test.twist
Run the reranking baseline.
python twist/generate_rerank_tldr.py --checkpoint-dirs <PATH>/scitldr_catts-xsum.tldr-aic/:<PATH>/scitldr_bart.tldr-ao --data-dirs summ/scitldr/SciTLDR-AIC/ctrl:summ/scitldr/SciTLDR-A/ctrl --checkpoint-files scitldr_catts-xsum.tldr-aic.pt:scitldr_bart.tldr-ao.pt --batch-size 1 --split test --beam 5 --batch-size 1 --out-file summ/scitldr/output/test.rerank.txt
Lastly, we provide tools for evaluations: COMET for machine translation and ROUGE for summarization. Use the sacrebleu library to measure the BLEU score. For example,
cd eval/COMET/
bash run.sh ../../fairseq/mt/domains/medicine/src/emea-test.de ../../fairseq/mt/domains/medicine/output/test.twist_update-2.txt ../../fairseq/mt/domains/medicine/tgt/emea-test.en.jsonl ../../fairseq/mt/domains/medicine/output/test.twist_update-2.comet
cd fairseq/
sacrebleu mt/domains/medicine/tgt/emea-test.en -i mt/domains/medicine/output/test.twist_update-2.txt -m bleu -b -w 4 -l de-en
cd eval/ROUGE/
bash run.sh ../../fairseq/summ/scitldr/output/test.twist_update-2.txt ../../fairseq/summ/scitldr/output/test.twist_update-2.txt ../../fairseq/summ/scitldr/tgt/test_refs.jsonl ../../fairseq/summ/scitldr/output/test.twist_update-2.rougeL rougeL
@misc{kasai2022twist,
author = {Jungo Kasai and
Keisuke Sakaguchi and
Ronan Le Bras and
Hao Peng and
Ximing Lu and
Dragomir Radev and
Yejin Choi and
Noah A. Smith},
title = {Twist Decoding: Diverse Generators Guide Each Other},
year = {2022},
url = {https://arxiv.org/abs/2205.09273},
}