System submitted to the IWSLT 2022 offline speech translation task by the UPC Machine Translation group.
The paper is available here.
This paper describes the submissions of the UPC Machine Translation group to the IWSLT 2022 Offline Speech Translation and Speech-to-Speech Translation tracks. The offline task involves translating English speech to German, Japanese and Chinese text. Our Speech Translation systems are trained end-to-end and are based on large pretrained speech and text models. We use an efficient fine-tuning technique that trains only specific layers of our system, and explore the use of adapter modules for the non-trainable layers. We further investigate the suitability of different speech encoders (wav2vec 2.0, HuBERT) for our models and the impact of knowledge distillation from the Machine Translation model that we use for the decoder (mBART). For segmenting the IWSLT test sets we fine-tune a pretrained audio segmentation model and achieve improvements of 5 BLEU compared to the given segmentation. Our best single model uses HuBERT and parallel adapters and achieves 29.42 BLEU at English-German MuST-C tst-COMMON and 26.77 at IWSLT 2020 test. By ensembling many models, we further increase translation quality to 30.83 BLEU and 27.78 accordingly. Furthermore, our submission for English-Japanese achieves 15.85 and English-Chinese obtains 25.63 BLEU on the MuST-C tst-COMMON sets. Finally, we extend our system to perform English-German Speech-to-Speech Translation with a pretrained Text-to-Speech model.
@inproceedings{tsiamas-etal-2022-pretrained,
title = "Pretrained Speech Encoders and Efficient Fine-tuning Methods for Speech Translation: {UPC} at {IWSLT} 2022",
author = "Tsiamas, Ioannis and
G{\'a}llego, Gerard I. and
Escolano, Carlos and
Fonollosa, Jos{\'e} and
Costa-juss{\`a}, Marta R.",
booktitle = "Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022)",
month = may,
year = "2022",
address = "Dublin, Ireland (in-person and online)",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.iwslt-1.23",
pages = "265--276",
}
- Environment Setup
- Pretrained Models
- Data
- Knowledge Distillation
- Training
- MuST-C Evaluation
- IWSLT Evaluation
Set the environment variables:
export IWSLT_ROOT=... # where to clone this repo
export FAIRSEQ_ROOT=... # where to clone fairseq
Clone this repository to $IWSLT_ROOT
:
git clone --recursive https://github.com/mt-upc/iwslt-2022.git ${IWSLT_ROOT}
Create a conda environment using the environment.yml
file, activate it and install Fairseq:
conda env create -f ${IWSLT_ROOT}/environment.yml && \
conda activate iwslt22 && \
pip install --editable ${IWSLT_ROOT}/fairseq/
Install NVIDIA's apex library for faster training with fp16 precision:
git clone https://github.com/NVIDIA/apex && cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
--global-option="--deprecated_fused_adam" --global-option="--xentropy" \
--global-option="--fast_multihead_attn" ./
In this project we use pre-trained speech encoders and text decoders.
Download HuBERT, wav2vec2.0 and mBART models to $MODELS_ROOT
:
export MODELS_ROOT=...
mkdir -p ${MODELS_ROOT}/{wav2vec,hubert,mbart}
wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_vox_960h_pl.pt -P ${MODELS_ROOT}/wav2vec
wget https://dl.fbaipublicfiles.com/hubert/hubert_large_ll60k.pt -P ${MODELS_ROOT}/hubert
wget https://dl.fbaipublicfiles.com/fairseq/models/mbart50/mbart50.ft.1n.tar.gz -O - | \
tar -xz --strip-components 1 -C ${MODELS_ROOT}/mbart
Set the data environment variables:
export MUSTC_ROOT=... # where to download MuST-C v2
export CV_ROOT=... # where to download the CommonVoice corpus 8.0
export EUROPARL_ROOT=... # where to download Europarl-ST
export IWSLT_TST_ROOT=... # where to download the IWSLT test sets
Download MuST-C v2 en-de, en-ja and en-zh to $MUSTC_ROOT
:
The dataset is available here. Press the bottom ”click here to download the corpus”, and select version V2.
Download the Common Voice version 8 and the CoVoST tsvs (en-de, en-ja, en-zh) to $CV_ROOT
:
export COVOST_ROOT=${CV_ROOT}/en/CoVoST
mkdir -p ${COVOST_ROOT}/{en-de,en-ja,en-zh}
wget https://mozilla-common-voice-datasets.s3.dualstack.us-west-2.amazonaws.com/cv-corpus-8.0-2022-01-19/cv-corpus-8.0-2022-01-19-en.tar.gz -P ${COVOST_ROOT}
wget https://dl.fbaipublicfiles.com/covost/covost_v2.en_de.tsv.tar.gz -P ${CoVoST_ROOT}/en-de
wget https://dl.fbaipublicfiles.com/covost/covost_v2.en_zh-CN.tsv.tar.gz -P ${CoVoST_ROOT}/en-zh
wget https://dl.fbaipublicfiles.com/covost/covost_v2.en_ja.tsv.tar.gz -P ${CoVoST_ROOT}/en-ja
Download Europarl-ST v1.1 to $EUROPARL_ROOT
:
mkdir -p ${EUROPARL_ROOT}
wget https://www.mllp.upv.es/europarl-st/v1.1.tar.gz -O - | tar -xz --strip-components 1 -C ${EUROPARL_ROOT}
Download the IWLST data (tst.2019,tst.2020,tst.2021,tst.2022):
mkdir -p $IWSLT_TST_ROOT
for year in {2019,2020,2021}; do
wget http://i13pc106.ira.uka.de/~jniehues/IWSLT-SLT/data/eval/en-de/IWSLT-SLT.tst2019.en-de.tgz
tar -xvf IWSLT-SLT.tst2022.en-${tgt_lang}.tgz -C ${IWSLT_TST_ROOT}
rm IWSLT-SLT.tst2022.en-${tgt_lang}.tgz
done
for tgt_lang in {de,ja,zh}; do
wget http://i13pc106.ira.uka.de/~jniehues/IWSLT-SLT/data/eval/en-${tgt_lang}/IWSLT-SLT.tst2022.en-${tgt_lang}.tgz
tar -xvf IWSLT-SLT.tst2022.en-${tgt_lang}.tgz -C ${IWSLT_TST_ROOT}
rm IWSLT-SLT.tst2022.en-${tgt_lang}.tgz
# get the file order for this pair
cut -d' ' -f1 ${IWSLT_TST_ROOT}/IWSLT.tst2022/IWSLT.TED.tst2022.en-${tgt_lang}.en.video_url > ${IWSLT_TST_ROOT}/IWSLT.tst2022/FILER_ORDER.en-${tgt_lang}
done
Convert the Common Voice clips to 16kHz and mono:
(We only need to convert the ones in the train, dev and test splits)
mkdir -p ${CV_ROOT}/en/clips_mono_16k
for split in {train,dev,test}; do
cat ${COVOST_ROOT}/${split}.tsv | cut -f2 | parallel -j $(eval nproc) ffmpeg -i ${CV_ROOT}/en//clips/{} \
-ac 1 -ar 16000 -hide_banner -loglevel error ${CV_ROOT}/en/clips_mono_16k/{.}.wav
done
Prepare the tsvs for the MuST-C, Europarl-ST and CoVoST data:
We do this process for both the ASR and ST tasks and for all language pairs.
We only prepare the tsvs and do not learn a vocabulary since we will reuse the one from mBART50.
# MuST-C (en-de,en-zh,en-ja)
for task in {asr,st}; do
python ${IWSLT_ROOT}scripts/data_prep/prep_mustc_data.py \
--data-root ${MUSTC_ROOT} --task $task --use-audio-input --only-manifest --append-lang-id
done
# Europarl-ST (en-de)
for task in {asr,st}; do
python ${IWSLT_ROOT}/scripts/data_prep/prep_europarl_data.py \
-d ${EUROPARL_ROOT} --lang-pair en-de --task st --use-audio-input --only-manifest --append-lang-id
done
# CoVoST (en-de,en-zh,en-ja)
for tgt_lang in {de,zh-CH,ja}; do
for task in {asr,st}; do
python ${IWSLT_ROOT}/scripts/data_prep/prep_covost_data.py \
-d $COVOST_ROOT -s en -t $tgt_lang --append-lang-id
done
done
Do ASR inference on the "train" sets using a pre-trained wav2vec 2.0 model and save the results at $FILTER_ROOT
:
export FILTER_ROOT=...
# MuST-C
for tgt_lang in {de,ja,zh}; do
python ${IWSLT_ROOT}/scripts/filtering/asr_inference.py \
--tsv_path ${MUSTC_ROOT}/en-${tgt_lang}/train_asr.tsv -o ${FILTERING_ROOT}/MUSTC_v2.0/en
done
# Europarl-ST
for split in {train,dev,test}; do
python ${IWSLT_ROOT}/scripts/filtering/asr_inference.py \
--tsv_path ${EUROPARL_ROOT}/en/en-de_${split}_asr.tsv -o ${FILTERING_ROOT}/EuroparlST/en
done
# CoVoST
for split in {train,dev,test}; do
for tgt_lang in {de,ja,zh}; do
python ${IWSLT_ROOT}/scripts/filtering/asr_inference.py \
--tsv_path ${COVOST_ROOT}/en-${tgt_lang}/${split}_asr.tsv -o ${FILTERING_ROOT}/CoVoST/en
done
done
Apply ASR-based and text-based filtering to create clean versions of the train sets:
# MuST-C
for tgt_lang in {de,ja,zh}; do
python ${IWSLT_ROOT}/filtering/filter_tsv.py \
-tsv ${MUSTC_ROOT}/en-${tgt_lang}/train_st.tsv \
-p ${FILTERING_ROOT}/MUSTC_v2.0/en/train_asr_wer_results.json \
-o ${MUSTC_ROOT}/en-${tgt_lang} \
-par -wer 0.75
done
# Europarl-ST
for split in {train,dev,test}; do
python ${IWSLT_ROOT}/filtering/filter_tsv.py \
-tsv ${EUROPARL_ROOT}/en/en-de_${split}_st.tsv \
-p ${FILTERING_ROOT}/EuroparlST/en/en-de_${split}_asr_wer_results.json \
-o ${EUROPARL_ROOT}/en \
-par -wer 0.75
done
# CoVoST
for tgt_lang in {de,ja,zh}; do
for split in {train,dev,test}; do
python ${IWSLT_ROOT}/filtering/filter_tsv.py \
-tsv ${COVOST_ROOT}/en-${tgt_lang}/${split}_st.tsv \
-p ${FILTERING_ROOT}/CoVoST/en/${split}_asr_wer_results.json \
-o ${COVOST_ROOT}/en-${tgt_lang} \
-par -wer 0.75
done
done
Set up the path:
export DATA_ROOT=...
mkdir -p ${DATA_ROOT}/{en-de,en-zh,en-ja}
Make symbolink links:
# from MuST-C
for tgt_lang in {de,ja,zh}; do
for task in {asr,st}; do
ln -s ${MUSTC_ROOT}/en-${tgt_lang}/train_${task}_filtered.tsv ${DATA_ROOT}/en-${tgt_lang}/train_${task}_mustc.tsv
ln -s ${MUSTC_ROOT}/en-${tgt_lang}/dev_${task}.tsv ${DATA_ROOT}/en-${tgt_lang}/dev_${task}_mustc.tsv
ln -s ${MUSTC_ROOT}/en-${tgt_lang}/tst-COMMON_${task}.tsv ${DATA_ROOT}/en-${tgt_lang}/tst-COMMON_${task}_mustc.tsv
done
done
# from Europarl-ST
for split in {train,dev,test}; do
for task in {asr,st}; do
if [[ $split != train ]]; then
ln -s ${EUROPARL_ROOT}/en/en-de_${split}_${task}_filtered.tsv ${DATA_ROOT}/en-de/train_${split}_${task}_europarl.tsv
else
ln -s ${EUROPARL_ROOT}/en/en-de_${split}_${task}_filtered.tsv ${DATA_ROOT}/en-de/${split}_${task}_europarl.tsv
fi
done
done
# from CoVoST
for tgt_lang in {de,ja,zh}; do
for split in {train,dev,test}; do
for task in {asr,st}; do
if [[ $split != train ]]; then
ln -s ${COVOST_ROOT}/en-${tgt_lang}/${split}_${task}_filtered.tsv ${DATA_ROOT}/en-${tgt_lang}/train_${split}_${task}_covost.tsv
else
ln -s ${COVOST_ROOT}/en-${tgt_lang}/${split}_${task}_filtered.tsv ${DATA_ROOT}/en-${tgt_lang}/${split}_${task}_covost.tsv
fi
done
done
done
We are using knowledge distillation for en-de with mBART50 as the teacher.
Extract the top-k probabilities offline before training and save them at $KD_ROOT
:
export KD_ROOT=...
for asr_tsv_file in ${DATA_ROOT}/en-de/train*asr*.tsv; do
st_tsv_file=$(echo $asr_tsv_file | sed "s/_asr_/_st_/g")
kd_subdir=$(basename "$st_tsv_file" .tsv)
python ${IWSLT_ROOT}knowledge_distillation/extract_topk_logits.py \
-asr $asr_tsv_file -st $st_tsv_file -o ${KD_ROOT}/en-de/${kd_subdir}
done
Set up the path to save the training outputs:
export SAVE_DIR=...
All our experiments can be found at ${IWSLT_ROOT}/config
.
To train an experiment called EXP_NAME
, run the following command:
EXP_NAME=... # one of the available experiments
# to adjust the update_freq according to the number of available GPUs
base_update_freq=24
n_gpus=$(nvidia-smi --list-gpus | wc -l)
fairseq-hydra-train \
--config-dir ${IWSLT_ROOT}/config/ \
--config-name ${EXP_NAME}.yaml \
dataset.num_workers=$(($(eval nproc) / 2)) \
optimization.update_freq=[$(( $base_update_freq / $n_gpus ))]
To generate the translations for the MuST-C dev or tst-COMMON sets run the following command:
EXP_NAME=... # one of the trained experiments
CKPT_NAME=... # the name of a .pt file
SUBSET=... # dev_mustc or tst-COMMON_mustc
TGT_LANG=... # de, zh or ja
${IWSLT_ROOT}/scripts/generate.sh $EXP_NAME $CKPT_NAME $SUBSET $TGT_LANG
To generate translations for the IWSLT test sets, we first have to segment the audio files.
We are using SHAS. Clone the SHAS repo at $SHAS_ROOT
:
git clone https://github.com/mt-upc/SHAS.git ${SHAS_ROOT}
Create an environment for the segmentation:
conda env create -f ${SHAS_ROOT}/environment.yml
Download the Multilingual checkpoint for the Segmentation Frame Classifier at $SHAS_ROOT/mult_sfc_model_epoch-4.pt
.
Segment the wav files of the IWSLT test sets with the multilingual classifier and the pDAC algorithm with max-segment-length of 16 and inference-times of 3, which were found to be optimal. Save the segmentation yaml at $path_to_custom_segmentation_yaml
:
conda activate shas
SUBSET=... # IWSLT.tst2019, IWSLT.tst2020, IWSLT.tst2021 or IWSLT.tst2022
python ${SHAS_ROOT}/src/supervised_hybrid/segment.py \
-wavs ${IWSLT_TST_ROOT}/${SUBSET}/wavs \
-ckpt ${SHAS_ROOT}/mult_sfc_model_epoch-4.pt \
-yaml $path_to_custom_segmentation_yaml \
-max 16 -n 3
To evaluate translations from a custom segmentation, we are using to mwerSegmenter to align the hypotheses with the references.
Download mwerSegmenter at ${MWERSEGMENTER_ROOT}
and follow the instructions in ${MWERSEGMENTER_ROOT}/README
to install it:
export MWERSEGMENTER_ROOT=...
mkdir -p $MWERSEGMENTER_ROOT
wget https://www-i6.informatik.rwth-aachen.de/web/Software/mwerSegmenter.tar.gz
tar -zxvf mwerSegmenter.tar.gz -C ${MWERSEGMENTER_ROOT}
rm -r mwerSegmenter.tar.gz
We also need a python2 environment to run it:
conda create -n snakes27 python=2.7
Generate translations on the created segmentation and calculate the BLEU scores if the $SUBSET
is IWSLT.tst2019 or IWSLT.tst2020:
${IWSLT_ROOT}/scripts/segmentation/eval.sh \
${SAVE_DIR}/${EXP_NAME}/ckpts/${CKPT_NAME} \
$SUBSET \
$TGT_LANG \
$path_to_custom_segmentation_yaml