This project explores and implements various text-to-speech (TTS) alignment techniques, aiming to improve the quality and efficiency of TTS systems. Our work spans multiple approaches, each addressing different aspects of the alignment challenge.
This repository is organized into three main branches, each representing a distinct approach to TTS alignment:
-
- Status: Completed, for reference only
- Description: Unofficial implementation of the "MoBoAligner: a Neural Alignment Model for Non-autoregressive TTS with Monotonic Boundary Search" paper
- Purpose: Learning and baseline comparison
- Limitation: Not suitable for large-scale applications due to maximum duration constraints
-
- Status: Development halted, for reference only
- Description: Experimental improvement attempt combining Rough Alignment with MoBoAligner
- Purpose: Explore self-supervised learning techniques in TTS alignment
- Limitation: Performance improvements were limited and did not meet expectations
-
OTA
👈 Current Focus- Status: In active planning and early development
- Description: Adaptation of the "One TTS Alignment To Rule Them All" (OTA) method for implicit pause modeling
- Goal: Develop a solution for handling implicit pauses without relying on explicit silence tokens
- Progress: Conceptual development and planning phase
Our primary focus is on the OTA
branch, where we're exploring ways to adapt the OTA method for improved alignment, especially in handling implicit pauses in speech.
- Check out each branch for specific implementation details and progress.
- Refer to individual branch READMEs for setup and usage instructions.
- For the latest developments, focus on the
OTA
branch.
We welcome contributions to any of our branches. If you're interested in contributing:
- Check the issues in the relevant branch for tasks you can help with.
- Fork the repository and create a pull request with your improvements.
- For major changes, please open an issue first to discuss what you would like to change.
- Implement MoBoAligner (unofficial implementation)
- Develop and test RoMoAligner
- Adapt and implement OTA for implicit pause modeling
- Conduct comparative studies across all methods
- Refine and optimize the most promising approach
- Original MoBoAligner paper
- OTA paper
We appreciate the support and interest from the TTS and speech processing community in advancing this research.