Merge pull request #318 from Maggione/main

mmspeech
OFA-Sys · Dec 13, 2022 · d4fb41b · d4fb41b
2 parents 5fd17e7 + 77b447e
commit d4fb41b
Show file tree

Hide file tree

Showing 27 changed files with 56,393 additions and 6 deletions.
diff --git a/README.md b/README.md
@@ -46,6 +46,7 @@ We support the inference of OFA in Huggingface Transformers. Check the [README](
 
 
 # News
+* 2022.12.7: Released the MMSpeech an ASR pre-training method based on OFA. Check our paper [here](https://arxiv.org/abs/2212.00500)! Please see the [README_mmspeech.md](README_mmspeech.md) for further details.
 * 2022.8.16: Released the **Chinese** version of OFA. **OFA-CN** needs only switching to `bpe_dir=../../utils/BERT_CN_dict` and `bpe=bert` and using our provided Chinese checkpoints in [checkpoints_cn.md](checkpoints_cn.md). Temporarily, we only provide base-size and large-size pretrained checkpoints and finetuned checkpoints on [MUGE Caption](https://tianchi.aliyun.com/muge) and the Chinese version of RefCOCO(-/+/g) (to release soon). 
 * 2022.8.5: Released support of **prompt tuning** for OFA. Check our paper [here](https://arxiv.org/abs/2208.02532)! Please see the [prompt_tuning.md](prompt_tuning.md) for further details.
 * 2022.7.7: Updated support of OFA on **huggingface transformers** (fixed bugs in forward, add sequence generator from Fairseq to ensure performance, etc.). Refer to the doc [transformers.md](transformers.md) and the branch `feature/add_transformers`. 

diff --git a/README_mmspeech.md b/README_mmspeech.md
@@ -0,0 +1,78 @@
+# MMSpeech: Multi-modal Multi-task Encoder-Decoder Pre-training for speech recognition
+
+<p align="center">
+        <a href="modelscope.md">ModelScope</a>&nbsp ｜ &nbsp<a href="https://arxiv.org/abs/2212.00500">Paper </a>&nbsp 
+</p>
+
+We propose a novel multi-modal multi-task encoder-decoder pre-training framework~(MMSpeech) for Mandarin automatic speech recognition~(ASR), which employs a multi-task learning framework including five self-supervised and supervised tasks with speech and text data. 
+Experiments on AISHELL-1 show that our proposed method achieves state-of-the-art performance, with a more than 40% relative improvement compared with other pre-training methods.
+
+<p align="center">
+    <br>
+    <img src="examples/mmspeech.png" width="700" />
+    <br>
+<p>
+<br>
+
+## Datasets & Checkpoints
+| Model          | Model Size |                  Unlabeled Speech                  |                Unlabeled Text                 |                 labeled                  |                                                      Pre-Training                                                       |                                                       Fine-Tuning                                                       |
+|:---------------|:----------:|:--------------------------------------------------:|:---------------------------------------------:|:----------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------:|:-----------------------------------------------------------------------------------------------------------------------:|
+| MMSpeech-Base1 |    210M    | [AISHELL-2](https://www.aishelltech.com/aishell_2) | [M6-Corpus](https://arxiv.org/abs/2103.00823) | [AISHELL-1](http://www.openslr.org/33/)  | [checkpoint](https://ofadatain.oss-cn-hangzhou.aliyuncs.com/mmspeech_open_source/github/ofa_mmspeech_base1_pretrain.pt) | [checkpoint](https://ofadatain.oss-cn-hangzhou.aliyuncs.com/mmspeech_open_source/github/ofa_mmspeech_base1_aishell1.pt) |
+| MMSpeech-Base2 |    210M    | [WenetSpeech](https://wenet.org.cn/WenetSpeech/)   |                   M6-Corpus                   |                AISHELL-1                 | [checkpoint](https://ofadatain.oss-cn-hangzhou.aliyuncs.com/mmspeech_open_source/github/ofa_mmspeech_base2_pretrain.pt) | [checkpoint](https://ofadatain.oss-cn-hangzhou.aliyuncs.com/mmspeech_open_source/github/ofa_mmspeech_base2_aishell1.pt) |
+| MMSpeech-Large |    609M    |                    WenetSpeech                     |                   M6-Corpus                   |                AISHELL-1                 | [checkpoint](https://ofadatain.oss-cn-hangzhou.aliyuncs.com/mmspeech_open_source/github/ofa_mmspeech_large_pretrain.pt) | [checkpoint](https://ofadatain.oss-cn-hangzhou.aliyuncs.com/mmspeech_open_source/github/ofa_mmspeech_large_aishell1.pt) |
+
+## Results on AISHELL-1
+- Compare MMSpeech-Base1 with the model of the same encoder size and amount of unlabeled speech data.
+
+| Model                            | dev (w/o LM) | dev (wit LM) | test (w/o LM) | test (with LM) |
+|:---------------------------------|:------------:|:------------:|:-------------:|:--------------:|
+| w/o pre-training                 |     6.4      |     5.2      |      6.8      |      5.7       |
+| Data2Vec                         |     3.8      |     3.7      |      4.1      |      3.9       |
+| MMSpeech-Base1                   |     2.4      |     2.1      |      2.6      |      2.3       |
+| MMSpeech-Base1 (w/o Fine-Tuning) |     2.5      |     2.3      |      2.6      |      2.3       |
+
+- Compare MMSpeech-Base2 with the model of the same encoder size and amount of unlabeled speech data.
+
+| Model            | dev (wit LM) | test (with LM) |
+|:-----------------|:------------:|:--------------:|
+| Wav2vec 2.0-Base |     4.2      |      4.7       |
+| HuBERT-Base      |     4.1      |      4.3       |
+| MMSpeech-Base2   |     2.0      |      2.1       |
+
+- Compare MMSpeech-Large with the model of the same encoder size and amount of unlabeled speech data.
+
+| Model             | dev (wit LM) | test (with LM) |
+|:------------------|:------------:|:--------------:|
+| Wav2vec 2.0-Large |     3.8      |      4.1       |
+| HuBERT-Large      |     3.1      |      3.3       |
+| MMSpeech-Large    |     1.6      |      1.9       |
+
+
+## Quick start
+### Data preparation
+
+Input files for all tasks include three columns: "speech_id, wav_path, text", delimited by a "\t". 
+- "wav_path" denotes the path for the wav files.
+- "text" denotes raw text inputs.
+- "pseduo-codes" can be obtained by following the steps in [wav2seq](https://github.com/asappresearch/wav2seq).
+
+| Data                  |   Task   | speech_id_col | wav_path_col |   text_col   |
+|:----------------------|:--------:|:-------------:|:------------:|:------------:|
+| unlabeled speech data | S2C, MSP |   speech_id   |   wav_path   | pseduo-codes |
+| unlabeled text data   |   P2T    |   speech_id   |   un-used    |     text     |
+| speech-text data      |   S2T    |   speech_id   |   wav_path   |     text     |
+
+We also provide example config_yaml of input fbank features for your reference in [here](http://ofadatain.oss-cn-hangzhou.aliyuncs.com/mmspeech_open_source/github/data/fbank_config.yaml).
+
+### training
+```commandline
+cd run_scripts/mmspeech
+sh mmspeech_cn_base_stage1.sh
+sh mmspeech_cn_base_stage2.sh
+sh mmspeech_cn_base_stage3.sh
+```
+### evaluation
+```commandline
+cd run_scripts/mmspeech
+sh evaluate_mmspeech_base.sh
+```
diff --git a/criterions/__init__.py b/criterions/__init__.py
@@ -2,3 +2,4 @@
 from .label_smoothed_cross_entropy import AdjustLabelSmoothedCrossEntropyCriterion
 from .clip_scst_loss import ClipScstRewardCriterion
 from .label_smoothed_encouraging_loss import AdjustLabelSmoothedEncouragingLossCriterion
+from .speech_pretrain_loss import SpeechPretrainLoss