Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions

in International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025
Authors: Lingwei Meng, Shujie Hu, Jiawen Kang, Zhaoqing Li, Yuejiao Wang, Wenxuan Wu, Xixin Wu, Xunying Liu, Helen Meng

Paper Link | Citation

This repository contains the implementation of the MT-LLM model for instruction-based multi-talker overlapped speech recognition.

Setup

cd MT-LLM/
git submodule update --init fairseq
conda create -n mtllm python=3.10.16; conda activate mtllm
pip install --editable fairseq/
pip install sentencepiece
pip install transformers==4.32.1
pip install numpy==1.23.5
pip install editdistance
pip install soundfile

Download the Trained Model

The trained MT-LLM model weights can be downloaded from here.

Inference

cp -r mtllm fairseq/examples
cd fairseq
data_name=for_demo
bash examples/mtllm/scripts/inference.sh $model_path $data_name

We provided several examples in test_data

Demos

Task	Audio	#Speakers	Instruction	Output
Multi-Talker ASR	`audio`	2	Transcribe the given audio into text. If multiple speakers are speaking, transcribe the utterances of multiple speakers in the order of their start times, separated by "<sc>".	two monsters only were creating all this commotion and before my eyes are two reptiles of the primitive world <sc> to relieve her from both he laid his hand with force upon his heart and said do you relieve me
Sex-Specific ASR	`audio`	2	Please transcribe the contents spoken by female speakers in overlapping speech.	to relieve her from that he laid his hand with force upon his heart and said do you believe me
Keyword-Tracing ASR	`audio`	2	Please transcribe the speech of the speaker who said the word "reptiles" in the overlapping speech audio.	two monsters only were creating all this commotion and before my eyes are two reptiles of the primitive world
Order-Specific ASR	`audio`	2	There are multiple speakers in the audio. Please transcribe the speech of the first speaker into text.	two monsters only were creating all this commotion and before my eyes are two reptiles of the primitive world
Target-Lingual ASR	`audio`	2	Please transcribe the person speaking German from the overlapping speech audio.	Mit dem Aufkommen des Christentums verloren die römischen Circusse an Bedeutung

Multi-Talker ASR	`audio`	3	Transcribe the given audio into text. If multiple speakers are speaking, transcribe the utterances of multiple speakers in the order of their start times, separated by "<sc>".	well mother said the young student looking up with a shade of impatience <sc> otherwise paul should have written grace from god the father and peace from our lord jesus christ <sc> consumption becomes a larger element in the standard of living in the city than in the country
Sex-Specific ASR	`audio`	3	Please transcribe the contents spoken by female speakers in overlapping speech.	well mother said the young student looking up with a shade of impatience <sc> consumption becomes a larger element in the standard of living in the city than in the country
Keyword-Tracing ASR	`audio`	3	Please transcribe the speech of the speaker who said the word "impatience" in the overlapping speech audio.	well mother said the young student looking up with a shade of impatience
Order-Specific ASR	`audio`	3	There are multiple speakers in the audio. Please transcribe the speech of the second speaker into text.	otherwise paul should have written grace from god the father and peace from our lord jesus christ
Target-Lingual ASR	`audio`	3	Please transcribe the person speaking English from the overlapping speech audio.	for three years he conducted vigorous campaigns in the western land where he met with vigorous resistance <sc> before the stragglers on the administration of law could be brought before the court of last resort and there met with the reversal and rebuke it deserved men were imprisoned under sentence of many years duration

Related works

Empowering Whisper for multi-talker and target-talker ASR
Sidecar: Convert a single-talker ASR systems to multi-talker one
Unified modeling of multi-talker speech recognition and diarization
SA-CTC: A speaker-aware CTC for multi-talker overlapped speech recognition
CSE-NET: A SOTA network architecture for multi-talker speech recognition

Citation

If you find our work is useful in your research, please cite the following paper:

@inproceedings{meng2025mtllm,
    title={Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions},
    author={Meng, Lingwei and Hu, Shujie and Kang, Jiawen and Li, Zhaoqing and Wang, Yuejiao and Wu, Wenxuan and Wu, Xixin and Liu, Xunying and Meng, Helen},
    booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
    year={2025}
}

@article{hu2024wavllm,
    title={WavLLM: Towards Robust and Adaptive Speech Large Language Model}, 
    author={Shujie Hu, Long Zhou, Shujie Liu, Sanyuan Chen, Lingwei Meng, Hongkun Hao, Jing Pan, Xunying Liu, Jinyu Li, Sunit Sivasankaran, Linquan Liu, Furu Wei},
    year={2024},
    eprint={2404.00656},
    archivePrefix={arXiv},
}

Acknowledgements

We have referenced a lot of code from WavLLM.

Portions of the source code are based on the FAIRSEQ and AV_HuBERT.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
fairseq @ 336c26a		fairseq @ 336c26a
imgs		imgs
mtllm		mtllm
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions

Setup

Download the Trained Model

Inference

Demos

Related works

Citation

Acknowledgements

About

Releases

Packages

Languages

License

cuhealthybrains/MT-LLM

Folders and files

Latest commit

History

Repository files navigation

Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions

Setup

Download the Trained Model

Inference

Demos

Related works

Citation

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages