forked from facebookresearch/fairseq
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
open source mbart (facebookresearch#1033)
Summary: # Before submitting - [ ] Was this discussed/approved via a Github issue? (no need for typos, doc improvements) - [ ] Did you read the [contributor guideline](https://github.com/pytorch/fairseq/blob/master/CONTRIBUTING.md)? - [ ] Did you make sure to update the docs? - [ ] Did you write any new necessary tests? ## What does this PR do? Fixes # (issue). ## PR review Anyone in the community is free to review the PR once the tests have passed. If we didn't discuss your PR in Github issues there's a high chance it will not be merged. ## Did you have fun? Make sure you had fun coding � Pull Request resolved: fairinternal/fairseq-py#1033 Differential Revision: D20122520 Pulled By: yinhanliu fbshipit-source-id: e2fd93e2fa9b7a8e276acc4316a176ba3ceae4ed
- Loading branch information
1 parent
f8b795f
commit 5e79322
Showing
10 changed files
with
461 additions
and
10 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,98 @@ | ||
# MBART: Multilingual Denoising Pre-training for Neural Machine Translation | ||
[https://arxiv.org/abs/2001.08210] | ||
|
||
## Introduction | ||
|
||
MBART is a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora in many languages using the BART objective. mBART is one of the first methods for pre-training a complete sequence-to-sequence model by denoising full texts in multiple languages, while previous approaches have focused only on the encoder, decoder, or reconstructing parts of the text. | ||
|
||
## Pre-trained models | ||
|
||
Model | Description | # params | Download | ||
---|---|---|--- | ||
`mbart.CC25` | mBART model with 12 encoder and decoder layers trained on 25 languages' monolingual corpus | 610M | [mbart.CC25.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/mbart/mbart.CC25.tar.gz) | ||
`mbart.ft.ro_en` | finetune mBART cc25 model on ro-en language pairs | 610M | [mbart.cc25.ft.enro.tar.gz](https://dl.fbaipublicfiles.com/fairseq/models/mbart/mbart.cc25.ft.enro.tar.gz) | ||
|
||
## Results | ||
|
||
**[WMT16 EN-RO](https://www.statmt.org/wmt16/translation-task.html)** | ||
|
||
_(test set, no additional data used)_ | ||
|
||
Model | en-ro | ro-en | ||
---|---|--- | ||
`Random` | 34.3 | 34.0 | ||
`mbart.cc25` | 37.7 | 37.8 | ||
`mbart.enro.bilingual` | 38.5 | 38.5 | ||
|
||
## BPE data | ||
# download model | ||
wget https://dl.fbaipublicfiles.com/fairseq/models/mbart/mbart.CC25.tar.gz | ||
tar -xzvf mbart.CC25.tar.gz | ||
# bpe data | ||
install SPM [here](https://github.com/google/sentencepiece) | ||
```bash | ||
SPM=/path/to/sentencepiece/build/src/spm_encode | ||
MODEL=sentence.bpe.model | ||
${SPM} --model=${MODEL} < ${DATA}/${TRAIN}.${SRC} > ${DATA}/${TRAIN}.spm.${SRC} & | ||
${SPM} --model=${MODEL} < ${DATA}/${TRAIN}.${TGT} > ${DATA}/${TRAIN}.spm.${TGT} & | ||
${SPM} --model=${MODEL} < ${DATA}/${VALID}.${SRC} > ${DATA}/${VALID}.spm.${SRC} & | ||
${SPM} --model=${MODEL} < ${DATA}/${VALID}.${TGT} > ${DATA}/${VALID}.spm.${TGT} & | ||
${SPM} --model=${MODEL} < ${DATA}/${TEST}.${SRC} > ${DATA}/${TEST}.spm.${SRC} & | ||
${SPM} --model=${MODEL} < ${DATA}/${TEST}.${TGT} > ${DATA}/${TEST}.spm.${TGT} & | ||
``` | ||
|
||
## Preprocess data | ||
|
||
```bash | ||
DICT=dict.txt | ||
python preprocess.py \ | ||
--source-lang ${SRC} \ | ||
--target-lang ${TGT} \ | ||
--trainpref ${DATA}/${TRAIN}.spm \ | ||
--validpref ${DATA}/${VALID}.spm \ | ||
--testpref ${DATA}/${TEST}.spm \ | ||
--destdir ${DEST}/${NAME} \ | ||
--thresholdtgt 0 \ | ||
--thresholdsrc 0 \ | ||
--srcdict ${DICT} \ | ||
--tgtdict ${DICT} \ | ||
--workers 70 | ||
``` | ||
|
||
## Finetune on EN-RO | ||
Finetune on mbart CC25 | ||
|
||
```bash | ||
PRETRAIN=/path/to/model/mbart.cc25 | ||
langs=ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN | ||
|
||
python train.py path_2_data --encoder-normalize-before --decoder-normalize-before --arch mbart_large --task translation_from_pretrained_bart --source-lang en_XX --target-lang ro_RO --criterion label_smoothed_cross_entropy --label-smoothing 0.2 --dataset-impl mmap --optimizer adam --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' --lr-scheduler polynomial_decay --lr 3e-05 --min-lr -1 --warmup-updates 2500 --total-num-update 40000 --dropout 0.3 --attention-dropout 0.1 --weight-decay 0.0 --max-tokens 1024 --update-freq 2 --save-interval 1 --save-interval-updates 5000 --keep-interval-updates 10 --no-epoch-checkpoints --seed 222 --log-format simple --log-interval 2 --reset-optimizer --reset-meters --reset-dataloader --reset-lr-scheduler --restore-file $PRETRAIN --langs $langs --layernorm-embedding --ddp-backend no_c10d | ||
``` | ||
## Generate on EN-RO | ||
Get sacrebleu on finetuned en-ro model | ||
|
||
set tokenizer [here](https://github.com/rsennrich/wmt16-scripts) | ||
wget https://dl.fbaipublicfiles.com/fairseq/models/mbart/mbart.cc25.ft.enro.tar.gz | ||
tar -xzvf mbart.cc25.ft.enro.tar.gz | ||
|
||
```bash | ||
model=model.pt | ||
python generate.py path_2_data --path $model --task translation_from_pretrained_bart --gen-subset test -t ro_RO -s en_XX --bpe 'sentencepiece' --sentencepiece-vocab sentence.bpe.model --sacrebleu --remove-bpe 'sentencepiece' --max-sentences 32 --langs $langs > en_ro | ||
|
||
cat en_ro | grep -P "^H" |sort -V |cut -f 3- | sed 's/\[ro_RO\]//g' |$TOKENIZER ro > en_ro.hyp | ||
cat en_ro | grep -P "^T" |sort -V |cut -f 2- | sed 's/\[ro_RO\]//g' |$TOKENIZER ro > en_ro.ref | ||
sacrebleu -tok 'none' -s 'none' en_ro.ref < en_ro.hyp | ||
``` | ||
|
||
## Citation | ||
|
||
```bibtex | ||
@article{liu2020multilingual, | ||
title={Multilingual Denoising Pre-training for Neural Machine Translation}, | ||
author={Yinhan Liu and Jiatao Gu and Naman Goyal and Xian Li and Sergey Edunov and Marjan Ghazvininejad and Mike Lewis and Luke Zettlemoyer}, | ||
year={2020}, | ||
eprint={2001.08210}, | ||
archivePrefix={arXiv}, | ||
primaryClass={cs.CL} | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.