This page contains a curated list of papers and resources for Speech Translation, with a focus on end-to-end systems. This list should be considered as a starting point for anyone with an interest in Speech Translation, not a definitive guide.
###Overview Speech Translation and the End-to-End Promise: Taking Stock of Where We Are; ACL 2020; Paper
Multilingual Speech Translation with Efficient Finetuning of Pretrained Models; ACL 2020; Paper
AdaST: Dynamically Adapting Encoder States in the Decoder for End-to-End Speech-to-Text Translation; ACL 2021; Paper
Cascade versus Direct Speech Translation: Do the Differences Still Make a Difference?; ACL 2021; Paper
Gender in Danger? Evaluating Speech Translation Technology on the MuST-SHE Corpus; ACL 2020; Paper
Highland Puebla Nahuatl–Spanish Speech Translation Corpus for Endangered Language Documentation; ACL 2021; Paper
Self-Training for End-to-End Speech Translation; Interspeech 2020; Paper
Towards Unsupervised Speech-to-text Translation; ICASSP 2019; Paper
Fluent Translations from Disfluent Speech in End-to-End Speech Translation; NAACL 2019; Paper
Bridging the Gap between Pre-Training and Fine-Tuning for End-to-End Speech Translation; AAAI 2020; Paper
Fused Acoustic and Text Encoding for Multimodal Bilingual Pretraining and Speech Translation; ICML 2021; Paper
Direct speech-to-speech translation with a sequence-to-sequence model (Translatotron 1); Interspeech 2019; Paper Google AI blog post; Google Research Audio Samples
Translatotron 2: Robust direct speech-to-speech translation; 2021 on ArXiV; Paper; Google Research Audio Samples
Assessing Evaluation Metrics for Speech-to-Speech Translation; 2021 ASRU; Paper
Transformer-based Direct Speech-to-speech Translation with Transcoder; 2021 SLT; Paper
Direct speech-to-speech translation with discrete units; 2021 on ArXiV; Paper
Direct simultaneous speech to speech translation; 2021 on ArXiV; Paper
Speech-to-speech Translation between Untranscribed Unknown Languages; 2019 ASRU; Paper
###CoVoST: All data at Facebook Research Github Repo
CoVoST 2: 21 X->En, 15 En->X speech-to-text language pairs; 2880 hours; Paper; MetaAI Announcement; HuggingFace Dataset
CoVoST 1: 11 X->En speech-to-text language pairs; 700 hours; Paper
MTedX: 11 languages into some of En, Es, Fr, It, Pt; 765 hours; Paper; Dataset
Europarl-ST: A Multilingual Corpus For Speech Translation Of Parliamentary Debates; with v1.1 all pairs of 9 European Languages = 72 directions; 1642 hours; Paper; Dataset
MaSS (Multilingual corpus of Sentence-aligned Spoken utterances): 8 languages => 56 directions; 172 hours; Paper; Datset
BSTC (Baidu Speech Translation Corpus): 50 hours Zh->En; Paper; Baidu page
Fisher and Callhome Spanish-English Speech Translation: 160 hours Es->En; Paper; Dataset
MaSS (Multilingual corpus of Sentence-aligned Spoken utterances): 8,130 parallel spoken utterances across 8 languages (56 language pairs); also provides text; Paper; Datset
Most speech-to-speech datasets, however, are largely produced through speech synthesis of speech-to-text translation datasets such as the Fisher, Conversational, or CoVoST 2 dataset (as is done for Translatotron).
See End-to-End Speech Translation Progress for more papers and datasets by Changhan Wang.
See Awesome speech translation for a very comprehensive set of papers (including Pipeline ST, streaming ST, and other ST problems) compiled by the Chinese Academy of Sciences & ByteDance AI Lab.
See ST Tutorial for a great introduction to Speech Translation with slides and resources, which were presented as at EACL 2021.