index-tts-lora

This project is based on Bilibili's index-tts, providing LoRA fine-tuning solutions for both single-speaker and multi-speaker setups. It aims to enhance prosody and naturalness in high-quality speaker audio synthesis.

Training & Inference

1. Audio token and speaker condition extraction

# Extract tokens and speaker conditions
python tools/extract_codec.py --audio_list ${audio_list} --extract_condition

# audio_list format: audio_path + transcript, separated by \t
/path/to/audio.wav 小朋友们，大家好，我是凯叔，今天我们讲一个龟兔赛跑的故事。

After extraction, the processed files and speaker_info.json will be generated under the finetune_data/processed_data/ directory. For example:

[
    {
        "speaker": "kaishu_30min",
        "avg_duration": 6.6729,
        "sample_num": 270,
        "total_duration_in_seconds": 1801.696,
        "total_duration_in_minutes": 30.028,
        "total_duration_in_hours": 0.500,
        "train_jsonl": "/path/to/kaishu_30min/metadata_train.jsonl",
        "valid_jsonl": "/path/to/kaishu_30min/metadata_valid.jsonl",
        "medoid_condition": "/path/to/kaishu_30min/medoid_condition.npy"
    }
]

2. Training

python train.py

3. Inference

python indextts/infer.py

Fine-tuning Results

This experiment uses Chinese audio data from Kai Shu Tells Stories, with a total duration of ~30 minutes and 270 audio clips. The dataset is split into 244 training samples and 26 validation samples. Note: Transcripts were generated automatically via ASR and punctuation models, without manual correction, so some errors are expected.

Example training sample, 他上了马车，来到了皇宫之中。：kaishu_train_01.wav

1. Speech Synthesis Examples

Text	Audio
老宅的钟表停在午夜三点，灰尘中浮现一串陌生脚印。侦探蹲下身，发现地板缝隙里藏着一枚带血的戒指。	kaishu_cn_1.wav
月光下，南瓜突然长出笑脸，藤蔓扭动着推开花园栅栏。小女孩踮起脚，听见蘑菇在哼唱古老的摇篮曲。	kaishu_cn_2.wav
那么Java里面中级还要学，M以及到外部前端的应用系统开发，要学到Java Script的数据库，要学做动态的网站。	kaishu_cn_en_mix_1.wav
这份 financial report 详细分析了公司在过去一个季度的 revenue performance 和 expenditure trends。	kaishu_cn_en_mix_2.wav
上山下山上一山，下一山，跑了三里三米三，登了一座大高山，山高海拔三百三。上了山，大声喊：我比山高三尺三。	kaishu_raokouling.wav
A thin man lies against the side of the street with his shirt and a shoe off and bags nearby.	kaishu_en_1.wav
As research continued, the protective effect of fluoride against dental decay was demonstrated.	kaishu_en_2.wav

2. Model Evaluation

For details of the evaluation set, see: 2025 Benchmark of Mainstream TTS Models: Who Is the Best Voice Synthesis Solution?

Acknowledgements

index-tts

finetune-index-tts

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
finetune_models		finetune_models
indextts		indextts
tests		tests
tools		tools
.gitignore		.gitignore
DISCLAIMER		DISCLAIMER
INDEX_MODEL_LICENSE		INDEX_MODEL_LICENSE
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
README_zh.md		README_zh.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run_extract_code.sh		run_extract_code.sh
setup.py		setup.py
train.py		train.py
vocab.txt		vocab.txt
webui.py		webui.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

index-tts-lora

Training & Inference

1. Audio token and speaker condition extraction

2. Training

3. Inference

Fine-tuning Results

1. Speech Synthesis Examples

2. Model Evaluation

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

License

asr-pub/index-tts-lora

Folders and files

Latest commit

History

Repository files navigation

index-tts-lora

Training & Inference

1. Audio token and speaker condition extraction

2. Training

3. Inference

Fine-tuning Results

1. Speech Synthesis Examples

2. Model Evaluation

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages