Skip to content

GiantAILab/DiaMoE-TTS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DiaMoE-TTS: A Unified IPA-based Dialect TTS Framework with Mixture-of-Experts and Parameter-Efficient Zero-Shot Adaptation


arXiv Hugging Face ModelScope


ipa_global

Overview ✨

This repository is designed to provide a comprehensive implementation for the series of research results of our unified dialect TTS. Specifically, this repository includes:

  • 🧠 A modular multi-dialect TTS framework built on F5-TTS.
  • 🔤 A unified IPA-based dialect frontend for consistent cross-dialect phonetic representation.
  • 🏋️ Training & inference scripts (CLI + config examples) for end-to-end reproduction.
  • 🤗 Hugging Face checkpoints for easy access to pre-trained models.

Short Intro on DiaMoE-TTS:

Dialect speech embodies rich cultural and linguistic diversity, yet building text-to-speech (TTS) systems for dialects remains challenging due to scarce data, inconsistent orthographies, and complex phonetic variation. To address these issues, we present DiaMoE-TTS, a unified IPA-based framework that standardizes phonetic representations and resolves grapheme-to-phoneme ambiguities. Built upon the F5-TTS architecture, the system introduces a dialect-aware Mixture-of-Experts (MoE) to model phonological differences and employs parameter-efficient adaptation with Low-Rank Adaptors (LoRA) and Conditioning Adapters for rapid transfer to new dialects. Unlike approaches dependent on large-scale or proprietary resources, DiaMoE-TTS enables scalable, open-data-driven synthesis. Experiments demonstrate natural and expressive speech generation, achieving zero-shot performance on unseen dialects and specialized domains such as Peking Opera with only a few hours of data.

backbone

The International Phonetic Alphabet (IPA) is the most widely used phonetic annotation system in the investigation and study of Chinese dialects. The vast majority of Chinese dialect corpora, including homophone tables, dictionaries and texts, utilize the IPA for phonetic transcription. The phonetic annotation system for this project is based on the IPA. It constructs a highly scalable phoneme inventory (currently containing 442 units) from a base of 100+ IPA phoneme symbols. This system is designed to support the phonetic annotation of all known Chinese dialects and is also extensible to European languages. (It currently supports 11 dialects and Mandarin; its validity has also been verified for English, French, German and the Bildts dialect of Dutch).

Regarding the construction details of the IPA dialect frontend system, please refer to:


News & Updates 🗞️

  • 🚀[2025-09-21] Initial public release of codebase.
  • 🔥[2025-09-25] Release checkpoints on 🤗 Hugging Face
  • 📦[2025-09-25] Release training datasets
  • 📄[2025-10-05] Update our paper on arXiv
  • 🧠[2025-10-08] Release gradio app for quick start!

Installation 🛠️

# clone code
git clone https://github.com/GiantAILab/DiaMoE-TTS.git
cd DiaMoE-TTS

# conda environment
conda create -n diamoetts python=3.10
conda activate diamoetts
cd diamoe_tts
pip install -e .

Quick Start 🚀

Training

cd diamoe_tts
accelerate launch --config_file default_config.yaml \
  src/f5_tts/train/train.py \
  --config-name diamoetts.yaml

Inference

bash ./src/f5_tts/infer/batch_infer.sh

See diamoe_tts for more details.

IPA Frontend

cd dialect_frontend
bash single_frontend.sh 1-6 <dialect_name> <input_file.txt>

See ipa_frontend for more details.


Datasets 📚

We utilize the Common Voice Cantonese dataset, the Emilia Mandarin dataset and dialectal data
from the KeSpeech corpus and a open-source Sourthern Min dataset for training.
We release the frontend of the 🤗open-source dataset IPA,🔮open-source dataset IPA


Pretrained Models 🧪

Model 🤗 Hugging Face 👷 Status
🚀 MLPexpert_base_model HF
🚀 yunbai(Peking Opera)_lora HF
🚀 jingbai(Peking Opera)_lora HF
🚀 nanjing_lora HF
🛠️ our g2pw HF

Development Roadmap & TODO 🗺️

  • release code for train/infer
  • release code for IPA frontend
  • release our checkpoints
  • release open-source training dataset IPA frontend
  • develop gradio app for DiaMoE-TTS

Acknowledgements 🙏

  • Thanks to all contributors and community members who helped improve this project.
  • This work builds upon F5-TTS and related research.
  • Our frontend uses PaddleSpeech to obtain Mandarin pinyin first.

License 📝

Our code is released under MIT License.

Star 🌟 History

Star History Chart

📚 Citation

If you find our model helpful, please consider citing our projects 📝 and staring us ⭐️!

@article{chen2025diamoe,
  title={DiaMoE-TTS: A Unified IPA-Based Dialect TTS Framework with Mixture-of-Experts and Parameter-Efficient Zero-Shot Adaptation},
  author={Chen, Ziqi and Chen, Gongyu and Wang, Yihua and Ding, Chaofan and Zhang, Wei-Qiang and others},
  journal={arXiv preprint arXiv:2509.22727},
  year={2025}
}

About

Official code for"DiaMoE-TTS: A Unified IPA-based Dialect TTS Framework with Mixture-of-Experts and Parameter-Efficient Zero-Shot Adaptation"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors