Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis
Lina-Speech is a neural codec language model that provides state-of-the-art performances on zero-shot TTS. It replaces self-attention with Gated Linear Attention, we believe it is a sound choice for audio. It features:
- Voice cloning with short samples by prompt continuation.
- High-throughput : batch inference can go high at no cost on a consumer grade GPU.
- Initial-State Tuning (s/o RWKV + fast implem by FLA): fast speaker adaptation by tuning a recurrent state. Save your context window from long prompt !
conda create -n lina python=3.10
conda activate lina
pip install torch==2.5.1
pip install causal-conv1d==1.3.0.post1
pip install -r requirements.txt
ln -s 3rdparty/flash-linear-attention/fla fla
ln -s 3rdparty/encoder encoder
ln -s 3rdparty/decoder decoder
cd 3rdparty/flash-linear-attention
git checkout 739ef15f97cff06366c97dfdf346f2ceaadf05ce
You will need this checkpoint of WavTokenizer and the config file : [WavTokenizer-ckpt] [config file]
Dataset: LibriTTS + LibriTTS-R + MLS-english split (10k hours) + GigaSpeech XL:
169M parameters version trained for 100B tokens: [Lina-Speech 169M]
See InferenceLina.ipynb
and complete the first cells with the correct checkpoints and config paths.
horse.mp4
count_it_up.mp4
- The RWKV authors and the community for carrying high-level truly open-source research.
- @SmerkyG for making our life easy at testing cutting edge language model.
- To the GLA/flash-linear-attention authors for their outstanding work.
- To the WavTokenizer authors for releasing such a brilliant speech codec.
- 🤗 for supporting this project.
@misc{lemerle2024linaspeechgatedlinearattention,
title={Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis},
author={Théodor Lemerle and Harrison Vanderbyl and Vaibhav Srivastav and Nicolas Obin and Axel Roebel},
year={2024},
eprint={2410.23320},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2410.23320},
}
Before using these pre-trained models, you agree to inform the listeners that the speech samples are synthesized by the pre-trained models, unless you have the permission to use the voice you synthesize. That is, you agree to only use voices whose speakers grant the permission to have their voice cloned, either directly or by license before making synthesized voices public, or you have to publicly announce that these voices are synthesized if you do not have the permission to use these voices.
This work has been initiated in the Analysis/Synthesis team of the STMS Laboratory at IRCAM, and has been funded by the following project: