MTalk-Bench: Evaluating Speech-to-Speech Models in Multi-Turn Dialogues via Arena-style and Rubrics Protocols

📄 Paper (arXiv) · 🌐 Project Page · 🤗 Dataset · 💻 Code

MTalk-Bench is a benchmark designed to evaluate speech-to-speech (S2S) large language models (LLMs) in realistic, multi-turn dialogue scenarios. It offers both arena-style and rubric-based evaluation protocols to comprehensively assess models across a diverse range of linguistic, paralinguistic, and acoustic dimensions.

🗒 MTalk-Bench Overview

The benchmark covers the following core evaluation aspects:

Semantic Information: Understanding & Memory, Reasoning & Execution, Interaction Strategy, Security Assessment, Pragmatic & Culture.
Paralinguistic Information: Paralinguistic Comprehension, Paralinguistic Generation.
Ambient Sound: Ambient Sound Perception, Multiparty Interaction.

MTalk-Bench is designed to reflect real-world conversational challenges and support fair, transparent, and extensible evaluation of next-generation S2S models.

📁 Repository Structure

MTalk-Bench/
├── data/
├── src/                     # Source codes for Audio LLM automated evaluation
│   ├── audio_arena_style.py
│   └── audio_rubric_based.py
└── asset

🗃️ Dataset Access

The MTalk-Bench dataset (including audio files, transcribed texts, and testing prompts) is available on 🤗 MTalk-Bench under a research license.

🚀 Quick Start

Follow the steps below to get started with MTalk-Bench evaluation.

1. Clone the Repository

git clone https://github.com/FreedomIntelligence/MTalk-Bench.git
cd MTalk-Bench

2. Download the Dataset

huggingface-cli download \
    --repo-type dataset \
    --resume-download \
    ./FreedomIntelligence/MTalk-Bench \
    --local-dir MTalk-Bench-Data

This will download the complete dataset (audio, transcripts, prompts) into MTalk-Bench-Data/.

3. Prepare Your Model Output

Run your chosen speech-to-speech (S2S) model on the MTalk-Bench dataset and generate audio responses.
Format your results as a .json file according to the required schema (example in ./data/sample.json) and place it in the data/ directory.

4. Run Evaluation

You can choose between arena-style and rubric-based evaluations, and select the type of information to evaluate (semantic, paralinguistic, or ambient).

# Arena-style example:
python ./src/audio_arena_api.py \
    --eval_type semantic \
    --judge_model gpt-4o-audio-preview \
    --new_data_file ./data/sample.json

# Rubric-based example:

python ./src/audio_rubric_api.py \
    --eval_type paralinguistic \
    --judge_model gemini-2.5-pro \
    --new_data_file ./data/sample.json

Available parameters:

eval_type: semantic, paralinguistic, ambient
judge_model: gpt-4o-audio-preview, gemini-2.5-pro
new_data_file: path to your .json result file

📄 Citation

If you use MTalk-Bench in your research, please cite:

@misc{du2025mtalkbenchevaluatingspeechtospeechmodels,
      title={MTalk-Bench: Evaluating Speech-to-Speech Models in Multi-Turn Dialogues via Arena-style and Rubrics Protocols}, 
      author={Yuhao Du and Qianwei Huang and Guo Zhu and Zhanchen Dai and Shunian Chen and Qiming Zhu and Yuhao Zhang and Li Zhou and Benyou Wang and Haizhou Li},
      year={2025},
      eprint={2508.18240},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.18240}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
asset		asset
data		data
src		src
static		static
.DS_Store		.DS_Store
.nojekyll		.nojekyll
LICENSE		LICENSE
README.md		README.md
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MTalk-Bench: Evaluating Speech-to-Speech Models in Multi-Turn Dialogues via Arena-style and Rubrics Protocols

🗒 MTalk-Bench Overview

📁 Repository Structure

🗃️ Dataset Access

🚀 Quick Start

1. Clone the Repository

2. Download the Dataset

3. Prepare Your Model Output

4. Run Evaluation

📄 Citation

About

Uh oh!

Releases

Packages

Contributors 5

Uh oh!

Languages

License

FreedomIntelligence/MTalk-Bench

Folders and files

Latest commit

History

Repository files navigation

MTalk-Bench: Evaluating Speech-to-Speech Models in Multi-Turn Dialogues via Arena-style and Rubrics Protocols

🗒 MTalk-Bench Overview

📁 Repository Structure

🗃️ Dataset Access

🚀 Quick Start

1. Clone the Repository

2. Download the Dataset

3. Prepare Your Model Output

4. Run Evaluation

📄 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Uh oh!

Languages

Packages