- 🚀Model Architecture
- 🤔What is UTAUTAI?
- 🐍Method
- 🧠TODO
- 🙏Appreciation
- ⭐️Show Your Support
- 🙆Welcome Contributions
An open-source repository aimed at generating matching vocal and instrumental tracks from lyrics, similar to Suno AI's Chirp and Riffusion.
UTAUTAI's method are mainly inspired by SPEAR TTS
During training, the input consists of semantic tokens obtained from 'lyrics2semantic AR', which extracts semantic tokens from lyrics, as well as Acoustic tokens. Additionally, MERT representations derived from the music are subjected to k-means quantization to obtain further semantic tokens.
However, during inference, it is not possible to obtain MERT representations from the music. Therefore, we train a Style Module following the methodology of Prompt TTS2 to acquire the target MERT representations from the prompt during inference. The Style Module is composed of a transformer-based diffusion model.
I think that using this approach, we can successfully accomplish the target tasks. What do you think?
- How can we obtain lyrics that match the cropped audio? Or should we even crop the audio in the first place? code
- Examine the handling of phonemization and special tokens, and make necessary code modifications. code
- Correct the collator in the dataset. code
- Complete the StyleModule inference code. code
- Other minor code fixes, such as masking strategies.
- Eliminate the diffusion model and adapt the consistency model.
- SPEAR TTS paper
- VALL-E paper
- JukeBox paper
- SoundStorm paper
- MusicLM paper
- AudioLM paper
- MusicGen paper
- PromptTTS2 paper
- lucidrains' Soundstorm repo
- soundstorm speechtokenizer
- lifeiteng's vall-e
- Plachtaa's VALL-E-X
- bark
If you find UTAUTAI interesting and useful, give us a star on GitHub! ⭐️ It encourages us to keep improving the model and adding exciting features.
Contributions are always welcome.