VStyle is a bilingual (Chinese & English) benchmark for voice style adaptation. It covers four key tasks:
- Acoustic attribute control
- Natural language instruction following
- Role-playing
- Implicit empathy
To enable automated and reproducible evaluation, we introduce the LALM-as-a-Judge framework, which assesses model outputs across three dimensions:
- Textual faithfulness (Is it saying the right thing?)
- Style adherence (Does it match the intended style?)
- Naturalness (Does it sound smooth and natural?)
VStyle goes beyond checking correctness — it evaluates how well the model speaks. Experiments on various open-source and commercial systems show its effectiveness in differentiating the voice style adaptation abilities of different models.
-
Evaluation results of different SLMs.
We evaluate three proprietary systems GPT-4o Audio (snapshot: gpt-4o-audio-preview-2025-06-03), GPT-4o-Mini Audio (snapshot: gpt-4o-mini-audio-preview-2024-12-17), and Doubao. Additionally, we include four open-source end-to-end speech language models with strong speech generation performance: Step-Audio, Kimi-Audio, Baichuan-Audio, and Qwen-2.5 Omni.
We provide a Gemini API–based evaluation tool for assessing voice synthesis quality across multiple dimensions. It automatically processes audio samples, generates scores, and produces comprehensive analysis reports.
Quick Example:
# Install dependencies
pip install google-generativeai matplotlib pandas tqdm
# Run evaluation on example data
python lalm_eval/gemini_eval.py \
--root_dir ./data/examples/model_res/en/wav \
--metadata_path ./data/examples/model_res/en/metadata.jsonl \
--out_dir ./data/examples/eval_res/en \
--gemini_api_key YOUR_API_KEYFor detailed usage instructions, see: lalm_eval/README.md.
For inference results of other models reported in our paper, please refer to the dataset at https://huggingface.co/datasets/zhanjun/VStyle-responses.
We reproduce the correlation study between human annotations and LALM-as-a-Judge as reported in the paper. This validates the reliability of automated evaluation.
Quick Example:
# Download evaluation results of all seven models
huggingface-cli download --repo-type dataset --local-dir-use-symlinks False zhanjun/VStyle-eval-results --local-dir VStyle-eval-results
# Compute Spearman correlations
python human_align/compute_model_human_spearman_r.pyFor detailed analysis instructions, see: human_align/README.md
To submit your evaluation results to VStyle, please send the results file (metadata_with_score.jsonl) to jzhan24@m.fudan.edu.cn.
This project is licensed under the MIT License.


