This project is based on Bilibili's index-tts, providing LoRA fine-tuning solutions for both single-speaker and multi-speaker setups. It aims to enhance prosody and naturalness in high-quality speaker audio synthesis.
# Extract tokens and speaker conditions
python tools/extract_codec.py --audio_list ${audio_list} --extract_condition
# audio_list format: audio_path + transcript, separated by \t
/path/to/audio.wav 小朋友们,大家好,我是凯叔,今天我们讲一个龟兔赛跑的故事。
After extraction, the processed files and speaker_info.json
will be generated under the finetune_data/processed_data/
directory. For example:
[
{
"speaker": "kaishu_30min",
"avg_duration": 6.6729,
"sample_num": 270,
"total_duration_in_seconds": 1801.696,
"total_duration_in_minutes": 30.028,
"total_duration_in_hours": 0.500,
"train_jsonl": "/path/to/kaishu_30min/metadata_train.jsonl",
"valid_jsonl": "/path/to/kaishu_30min/metadata_valid.jsonl",
"medoid_condition": "/path/to/kaishu_30min/medoid_condition.npy"
}
]
python train.py
python indextts/infer.py
This experiment uses Chinese audio data from Kai Shu Tells Stories, with a total duration of ~30 minutes and 270 audio clips. The dataset is split into 244 training samples and 26 validation samples. Note: Transcripts were generated automatically via ASR and punctuation models, without manual correction, so some errors are expected.
Example training sample, 他上了马车,来到了皇宫之中。
:kaishu_train_01.wav
Text | Audio |
---|---|
老宅的钟表停在午夜三点,灰尘中浮现一串陌生脚印。侦探蹲下身,发现地板缝隙里藏着一枚带血的戒指。 | kaishu_cn_1.wav |
月光下,南瓜突然长出笑脸,藤蔓扭动着推开花园栅栏。小女孩踮起脚,听见蘑菇在哼唱古老的摇篮曲。 | kaishu_cn_2.wav |
那么Java里面中级还要学,M以及到外部前端的应用系统开发,要学到Java Script的数据库,要学做动态的网站。 | kaishu_cn_en_mix_1.wav |
这份 financial report 详细分析了公司在过去一个季度的 revenue performance 和 expenditure trends。 | kaishu_cn_en_mix_2.wav |
上山下山上一山,下一山,跑了三里三米三,登了一座大高山,山高海拔三百三。上了山,大声喊:我比山高三尺三。 | kaishu_raokouling.wav |
A thin man lies against the side of the street with his shirt and a shoe off and bags nearby. | kaishu_en_1.wav |
As research continued, the protective effect of fluoride against dental decay was demonstrated. | kaishu_en_2.wav |
For details of the evaluation set, see: 2025 Benchmark of Mainstream TTS Models: Who Is the Best Voice Synthesis Solution?