We provide code for the reproduction of the main results in Jointly Learning Visual and Auditory Speech Representations from Raw Data and BRAVEn: Improving Self-Supervised Pre-training for Visual and Auditory Speech Recognition . Our implementation is based on PyTorch Lightning.
conda env create -f environment.yml
. Change the environment prefix to match the location of miniconda3, if necessary.
- The datasets used in the paper can be downloaded from the following links:
- Compute 68 landmarks per frame using e.g., RetinaFace and 2-D FAN, or download them e.g., from this repo. Each landmark file should have the same name as its corresponding video (except that it ends in .npy).
- Use the following command to crop the mouths:
python preprocessing/extract_mouths.py --src_dir ${SOURCE_DIR} --tgt_dir ${TARGET_DIR} --landmarks_dir ${LANDMARKS_DIR}
Below are the checkpoints of the Base and Large models pre-trained with RAVEn on LRS3+Vox2-en.
Model | Modality | Checkpoint |
---|---|---|
Base | Video | Download |
Base | Audio | Download |
Large | Video | Download |
Large | Audio | Download |
Below are the checkpoints of the Base, Base+, and Large models pre-trained with BRAVEn.
Model | Modality | Checkpoint |
---|---|---|
Base (LRS3) | Video | Download |
Base (LRS3) | Audio | Download |
Base+ (LRS3+Vox2) | Video | Download |
Base+ (LRS3+Vox2) | Audio | Download |
Large (LRS3+Vox2+AVS) | Video | Download |
Large (LRS3+Vox2+AVS) | Audio | Download |
-
Below are the checkpoints corresponding to Tables 1 and 2 for VSR and ASR on LRS3. Models are provided for both low- and high-resource labelled data settings. In the high-resource setting, the models are fine-tuned on the full LRS3 dataset (433 hours). In the low-resource setting, they are fine-tuned on a subset ("trainval") of LRS3 (30 hours).
-
In some cases, the models were re-trained so the WER may differ slightly from the ones shown in the paper (which are also reproduced below).
-
The paths for the slurm bash scripts used for inference are shown in the table below. Note that the scripts may need to be modified according to the cluster environment.
-
The language model we used in this work can be found here.
Model | Pre-training dataset | WER (%) | Checkpoint | Bash script |
---|---|---|---|---|
Base | LRS3 | 47.0 | Download | scripts/vsr/lrs3_trainval/base_lrs3.sh |
Base | LRS3+Vox2-en | 40.2 | Download | scripts/vsr/lrs3_trainval/base_lrs3vox2.sh |
Large | LRS3+Vox2-en | 32.5 | Download | scripts/vsr/lrs3_trainval/large_lrs3vox2.sh |
Large w/ ST | LRS3+Vox2-en | 24.8 | Download | scripts/vsr/lrs3_trainval/large_lrs3vox2_self.sh |
Large w/ ST + LM | LRS3+Vox2-en | 23.8 | same as last row | scripts/vsr/lrs3_trainval/large_lrs3vox2_self_lm.sh |
Model | Pre-training dataset | WER (%) | Checkpoint | Bash script |
---|---|---|---|---|
Base | LRS3 | 43.4 | Download | scripts/vsr/lrs3_trainval/base_lrs3_braven.sh |
Base Plus | LRS3+Vox2-en | 35.1 | Download | scripts/vsr/lrs3_trainval/baseplus_lrs3vox2_braven.sh |
Large | LRS3+Vox2-en | 30.8 | Download | scripts/vsr/lrs3_trainval/large_lrs3vox2_braven.sh |
Large | LRS3+Vox2-en+AVS | 24.8 | Download | scripts/vsr/lrs3_trainval/large_lrs3vox2avs_braven.sh |
Large w/ ST | LRS3+Vox2-en+AVS | 21.3 | Download | scripts/vsr/lrs3_trainval/large_lrs3vox2avs_self_braven.sh |
Large w/ ST + LM | LRS3+Vox2-en+AVS | 20.0 | same as last row | scripts/vsr/lrs3_trainval/large_lrs3vox2avs_self_lm_braven.sh |
Model | Pre-training dataset | WER (%) | Checkpoint | Bash script |
---|---|---|---|---|
Base | LRS3 | 39.1 | Download | scripts/vsr/lrs3/base_lrs3.sh |
Base | LRS3+Vox2-en | 33.1 | Download | scripts/vsr/lrs3/base_lrs3vox2.sh |
Large | LRS3+Vox2-en | 27.8 | Download | scripts/vsr/lrs3/large_lrs3vox2.sh |
Large w/ ST | LRS3+Vox2-en | 24.4 | Download | scripts/vsr/lrs3/large_lrs3vox2_self.sh |
Large w/ ST + LM | LRS3+Vox2-en | 23.1 | same as last row | scripts/vsr/lrs3/large_lrs3vox2_self_lm.sh |
Model | Pre-training dataset | WER (%) | Checkpoint | Bash script |
---|---|---|---|---|
Base | LRS3 | 36.0 | Download | scripts/vsr/lrs3/base_lrs3_braven.sh |
Base Plus | LRS3+Vox2-en | 28.8 | Download | scripts/vsr/lrs3/baseplus_lrs3vox2_braven.sh |
Large | LRS3+Vox2-en | 26.6 | Download | scripts/vsr/lrs3/large_lrs3vox2_braven.sh |
Large | LRS3+Vox2-en+AVS | 23.6 | Download | scripts/vsr/lrs3/large_lrs3vox2avs_braven.sh |
Large w/ ST | LRS3+Vox2-en+AVS | 20.9 | Download | scripts/vsr/lrs3/large_lrs3vox2avs_self_braven.sh |
Large w/ ST + LM | LRS3+Vox2-en+AVS | 20.1 | same as last row | scripts/vsr/lrs3/large_lrs3vox2avs_self_lm_braven.sh |
Model | Pre-training dataset | WER (%) | Checkpoint | Bash script |
---|---|---|---|---|
Base | LRS3 | 4.7 | Download | scripts/asr/lrs3_trainval/base_lrs3.sh |
Base | LRS3+Vox2-en | 3.8 | Download | scripts/asr/lrs3_trainval/base_lrs3vox2.sh |
Large | LRS3+Vox2-en | 2.7 | Download | scripts/asr/lrs3_trainval/large_lrs3vox2.sh |
Large w/ ST | LRS3+Vox2-en | 2.3 | Download | scripts/asr/lrs3_trainval/large_lrs3vox2_self.sh |
Large w/ ST + LM | LRS3+Vox2-en | 1.9 | same as last row | scripts/asr/lrs3_trainval/large_lrs3vox2_self_lm.sh |
Model | Pre-training dataset | WER (%) | Checkpoint | Bash script |
---|---|---|---|---|
Base | LRS3 | 4.0 | Download | scripts/asr/lrs3_trainval/base_lrs3_braven.sh |
Base Plus | LRS3+Vox2-en | 3.0 | Download | scripts/asr/lrs3_trainval/baseplus_lrs3vox2_braven.sh |
Large | LRS3+Vox2-en | 2.3 | Download | scripts/asr/lrs3_trainval/large_lrs3vox2_braven.sh |
Large | LRS3+Vox2-en+AVS | 2.1 | Download | scripts/asr/lrs3_trainval/large_lrs3vox2avs_braven.sh |
Large w/ ST | LRS3+Vox2-en+AVS | 1.9 | Download | scripts/asr/lrs3_trainval/large_lrs3vox2avs_self_braven.sh |
Large w/ ST + LM | LRS3+Vox2-en+AVS | 1.7 | same as last row | scripts/asr/lrs3_trainval/large_lrs3vox2avs_self_lm_braven.sh |
Model | Pre-training dataset | WER (%) | Checkpoint | Bash script |
---|---|---|---|---|
Base | LRS3 | 2.2 | Download | scripts/asr/lrs3/base_lrs3.sh |
Base | LRS3+Vox2-en | 1.9 | Download | scripts/asr/lrs3/base_lrs3vox2.sh |
Large | LRS3+Vox2-en | 1.4 | Download | scripts/asr/lrs3/large_lrs3vox2.sh |
Large w/ ST | LRS3+Vox2-en | 1.4 | Download | scripts/asr/lrs3/large_lrs3vox2_self.sh |
Large w/ ST + LM | LRS3+Vox2-en | 1.4 | same as last row | scripts/asr/lrs3/large_lrs3vox2_self_lm.sh |
Model | Pre-training dataset | WER (%) | Checkpoint | Bash script |
---|---|---|---|---|
Base | LRS3 | 1.9 | Download | scripts/asr/lrs3/base_lrs3_braven.sh |
Base Plus | LRS3+Vox2-en | 1.4 | Download | scripts/asr/lrs3/baseplus_lrs3vox2_braven.sh |
Large | LRS3+Vox2-en | 1.2 | Download | scripts/asr/lrs3/large_lrs3vox2_braven.sh |
Large | LRS3+Vox2-en+AVS | 1.2 | Download | scripts/asr/lrs3/large_lrs3vox2avs_braven.sh |
Large w/ ST | LRS3+Vox2-en+AVS | 1.2 | Download | scripts/asr/lrs3/large_lrs3vox2avs_self_braven.sh |
Large w/ ST + LM | LRS3+Vox2-en+AVS | 1.1 | same as last row | scripts/asr/lrs3/large_lrs3vox2avs_self_lm_braven.sh |
Code for pre-training and fine-tuning coming soon...
If you find this repo useful for your research, please consider citing the following:
@article{haliassos2022jointly,
title={Jointly Learning Visual and Auditory Speech Representations from Raw Data},
author={Haliassos, Alexandros and Ma, Pingchuan and Mira, Rodrigo and Petridis, Stavros and Pantic, Maja},
journal={arXiv preprint arXiv:2212.06246},
year={2022}
}
@inproceedings{haliassos2024braven,
title={BRAVEn: Improving Self-supervised pre-training for Visual and Auditory Speech Recognition},
author={Haliassos, Alexandros and Zinonos, Andreas and Mira, Rodrigo and Petridis, Stavros and Pantic, Maja},
booktitle={ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={11431--11435},
year={2024},
organization={IEEE}
}