Follow the main developer's stream for live R&D:
| Status | License | Contribution |
|---|---|---|
| In-Development | LGPL-3.0 | Welcomes Contributions |
The ALIZARIN Engine is currently in Phase 1: Foundation & Data Generation. This project is being developed live on stream as a public-facing R&D process.
The engine's first voice, the mascott voice and proof-of-concept, is developed under the name ALIZARIN sharing a name with the engine.
The ALIZARIN Engine is a high-level, LGPLv3-licensed voice synthesis framework designed for generating multiple voice products (real-time TTS, AI singing, and concatenative singing) from a single, 100% synthetic source. This initiative is inspired by the pioneering work of missile_39's Adachi Rei, but takes a different approach to the "no-human-inside" concept by utilizing a unique hybrid method that incorporates both modern latent-space AI and algorithmic formant synthesis reminiscent of classic systems based on the style of the Votrax SC-01 and its Tandy TRS-80 Voice Synthesizer.
The name is an acronym for the core components of its design:
- Augmented (The layered texture system)
- Latent (The core AI representation)
- Intonation (The quality speech output)
- Synthesizer (The engine's primary function)
- Algorithm (The base Pyo formant system)
- Recursive (The neural network's processing method)
- Inference (The process of generating audio)
- Network (The foundational neural architecture)
The ALIZARIN Engine is an open-source, community-focused project with a specific set of goals:
- Unique, Non-Human Voices: The engine's core philosophy is "no human inside." It is designed to create voices from the ground up using algorithmic and synthetic sources, allowing for truly unique, non-human character identities like missile_39's Adachi Rei voice. I aim not to replace or copy their work, but to pay tribute to their work by adding new voice models and styles to the archetype missile_39's models fill with unique sounding models.
- Ethical Commercial Voices: This framework provides a high-quality, free, and open-source path for commercial voices without needing to record, clone, or "steal" an existing human voice.
- Protect Voice Actors: The synthetic nature of the voice makes it identifiable as non-human, reducing the chances of being used in fraud or to replace an existing voice actor. This promotes ethical behavior by offering a creative alternative to unauthorized voice clones that I hope many will adopt as an ethical alternative.
Goal: To create an "all-encompassing" voice solution, pairing modern TTS with expressive singers (DiffSinger, UTAU) to allow creators to build complete character identities with consistent voices across delivery methods.
This project is built on an ideology of community contribution and creator freedom. We use the LGPLv3 License to achieve a "best-of-both-worlds" scenario:
- A Free, Shared Community Resource: The ALIZARIN Engine framework is (and always will be) open-source. Any modifications or improvements made to the engine framework itself must be shared back with the community.
- Allowance for Unique, Private Voices: The LGPLv3 license draws a "bright line" between "The Library" (our engine) and "A Work that Uses The Library" (your voice). Your "secret sauce" (formant scripts, texture kits, training data) remains 100% your own property and does not need to be shared, allowing for unique privately licensed voices made from a free and growing community ecosystem.
The following points are not requirements or conditions of the LGPLv3 license, nor are they possible restrictions to place under the required license to make this sharable. These are our sincere requests to you, the user, to foster a positive and ethical community environment.
- A Request from the Creator (Sharing): We would love for you to share your public creations with the community and the ALIZARIN Engine team. If your voice is posted publicly, please consider sending us a link and use #ALIZARIN on your social media posts to tag us!
- A Request from the Creator (Ethical Usage): We ask all creators using this framework to voluntarily adopt the following minimal restrictions for any voice derived from the ALIZARIN Engine:
- No Hateful Content: Do not use the voice to create or distribute hate speech, harassment, severe threats, or content promoting violence or illegal acts.
- No Impersonation or Fraud: Do not use the voice to impersonate identifiable individuals, especially for the purpose of financial fraud, manipulation, or unauthorized commercial exploitation.
- Transparent Disclosure: Disclose clearly that the voice used in any public-facing content is synthetic and was created by a computational process.
This section summarizes the project's core documentation. For full details, please refer to the dedicated pages:
- VOICE EULA & GUIDELINES (USAGE_GUIDE.md): (READ FIRST) The full EULA, IP rules for the primary ALIZARIN voice, and the special usage request form.
- MULTILINGUAL_RVC_HYBRID_GUIDE.md: The official guide for creating voicebanks in other languages. This is the only potentially permitted exception to our "no-AI-training" rule.
- CREATOR'S GUIDE: Detailed suggestions, Q&A, and best practices for creating and protecting your own voice and character IP.
- THIRD-PARTY LICENSES: A full list of the open-source software that powers this engine and their licenses.
This section details the rules governing the use of the audio files and voice data for the primary ALIZARIN voice and the minimum standard suggested guidelines for our users to adopt for all community-created voices.
| Rule Category | ALIZARIN Voice Requirements | ALIZARIN Minimum Suggested Guidelines |
|---|---|---|
| Commercial Music Use | Allowed (Monetization permitted) | Allowed (Monetization permitted) |
| Hate Speech/Slander | Strictly Prohibited and enforced. | Strictly Prohibited. |
| Impersonation/Fraud | Strictly Prohibited and enforced. | Strictly Prohibited. |
| R-18/Adult Content | Prohibited (unless explicitly permitted, see usage request form). | (Not included in minimum suggested guidelines) |
| Religious/Political Usage | Prohibited (see documentation for details). | (Not included in minimum suggested guidelines) |
| Voice Redistribution | Prohibited (Requires EULA agreement). | Prohibited. |
| Voice Alterations/Modifications | Prohibited, with Exceptions. (Prohibited modifications are detailed on the documentation page.) | (Not included in minimum suggested guidelines) |
This section details the rules governing the use of the visual art, name, and personality of the primary ALIZARIN character.
| Rule Category | ALIZARIN Character Requirements | ALIZARIN Minimum Suggested Guidelines |
|---|---|---|
| Commercial Character IP or Visual Art Use | Requires permission for attachment to a product, API, software, or similar product. (For inquiries, see the special requests form). | Prohibited for commercial use without permission from the IP holder. |
| Name/Identity Use | Requires permission. (Cannot be used as a primary brand for a product/API, for inquiries see the special requests form.) | Prohibited for use as the brand/name of a product, API, or software. |
| Hate Speech/Slander | Strictly Prohibited and enforced. | Strictly Prohibited. |
| Religious/Political Depictions | Prohibited (see documentation for details). | (Not included in minimum suggested guidelines) |
| R-18/Adult Content | Prohibited (unless explicitly permitted, see special requests form). | (Decide this at your own discretion) |
| Derivative Art/Fan Work | Allowed and Encouraged (Must adhere to ethical usage policies). | Allowed. |
| Usage Request Form | Required for all special usages (e.g., visual art, format conversion). | (Separate form is not required for community voices) |
This repository will be updated with generalized, open-source scripts (licensed LGPLv3) as development of the main "ALIZARIN" voice is completed.
- 1.1: Create the "Texture Kit" (static, storms, hums).
- 1.2: Design and build the "Formant" synthesizer algorithm in Pyo.
- 1.3: Create the "Hybrid" audio generation script.
- 1.4: Generate the 30-60 minute hybrid audio training dataset.
- 2.1: Train the MeloTTS (Real-Time TTS) model for the main ALIZARIN voice.
- 2.2: Train the DiffSinger (AI Singing) model for the main ALIZARIN voice.
- 2.3: Generate and configure the OpenUTAU (Concatenative) voicebank for the main ALIZARIN voice.
- 2.4: (Internal) Train an RVC model for personal use.
- 2.5: [Community Base Model Goal] Generate untextured training data for the High-Pitch and Low-Pitch Community Base Models.
- 2.6: [Conditional Goal] Train and Release Open-Source Community Base Models (TTS/DiffSinger) (IF CROWDFUNDING GOAL MET).
- 3.1: Build and test the public-facing commercial TTS API.
- 3.2: Build and test the internal, low-latency streaming API (based on the
acryliccconcept). - 3.3: Create the "live-sing" DiffSinger & UTAU endpoints for on-stream use.
- 4.1: Draft the Voicebank EULA and Character License.
- 4.2: Finalize packaging and distribution plan.
The following goals represent tangible product releases and services that will be delivered upon successful funding.
Please Note: These goal amounts are early estimates based on estimated current server costs and R&D requirements. They are subject to change if we encounter unforeseen technical difficulties or significant changes in cloud computing costs. We will be 100% transparent with the community about any adjustments needed. This is almost entirely covering training costs and is not allocated to be paying much for the development time. The project is a work of passion and will still likely take a heavy loss on development if we only meet 2 goals or less.
| Goal | Description | Cost | Funded |
|---|---|---|---|
| Upgrade 1: Finetune Duo Voice Pack | HARD GOAL (Singing/Data): Fund the complete process of generating, training, and releasing the two pre-trained DiffSinger Community Base Models and their raw, untextured synthetic training data. This includes all required server time and R&D hours to train two models from scratch. | $750 | ❌ |
| Upgrade 2: Finetune Duo TTS Pack | HARD GOAL (Speaking/TTS): Fund the process of generating TTS-specific datasets and fine-tuning the base MeloTTS model with the new speakers. This results in the release of the two pre-trained MeloTTS Community Base Models, making custom TTS creation easier for users. | $900 | ❌ |
| Upgrade 3: Public RVC Sandbox | Release a specialized Python/Colab notebook for easy, localized, fine-tuning of the base models into a real-time RVC voice changer. | $1050 | ❌ |
| Upgrade 4: "ALIZARIN Streamer" TTS App | Develop and release a standalone desktop application (.exe) for streamers. This tool will provide a user-friendly GUI for the TTS engine, connecting to Twitch (for bits/redeems/etc) and allowing users to plug in the two Community Models or their own custom-trained ALIZARIN voices. |
$1300 | ❌ |
These features rely on collaboration and linguistic expertise from the community, as the creator only speaks English.
| Goal | Description |
|---|---|
| Multilingual Phonemizer Integration | Integrate, test, and validate code to support additional languages (e.g., Japanese/Spanish) in the ALIZARIN Engine. Requires collaboration with native speakers for linguistic validation and accent refinement. |
| Advanced Accent/Pronunciation Support | Research and develop methods to incorporate slight characterization or accent without losing the non-human synthetic core. |
| Non-English Language Pack Integration | Integrate and validate open-source phonemizer tools for non-English languages (requires external linguistic testing and time). |
These are complex, long-term experimental projects that will be prioritized after the main ALIZARIN voice is successfully delivered and its core technology is stable.
| Goal | Description |
|---|---|
| Consenting Voice Actor Module | Develop the workflow and legal framework for co-owning hybrid voices made by mixing synthetic audio with small human voice samples. The goal is to provide slight characterization and clarity in pronunciation or accent as an alternative option while strictly maintaining the non-human sound and personality of the final voice. |
| Hybrid Data Training & Automation Tool | Research and develop a tool to automate the mixing of the current synthetic output (Formant + Textures) with additional, specialized synthetic elements (like Adachi Rei's pure sine/buzz sounds). The goal is to create novel training sets that improve tone and quality by precisely blending these core synthetic sources. |
| Initial Paper & Methodology Release | Publish a detailed paper/blog post explaining the full technical and ethical methodology of the ALIZARIN Engine. This serves as an official output goal to establish academic recognition and attract contributors. |
The ALIZARIN Engine is not a single piece of software, but an integrated pipeline of open-source tools.
| Component | Technology | Purpose |
|---|---|---|
| Base Voice Synthesis | Pyo | (LGPLv3) A Python DSP library for building the "formant" synthesizer. |
| Real-Time TTS | MeloTTS | (Apache 2.0) The engine for the real-time, low-latency TTS. |
| AI Singing Voice | DiffSinger (Amphion) | (MIT) The engine for the high-quality, expressive AI singing model. |
| Concatenative Singing | OpenUTAU | (MIT) The editor and platform for the "classic" UTAU voicebank. |
| Concatenative Engine | WavRS | (MIT) The real-time UTAU renderer server for TTS/live-singing. |
| Internal API Blueprint | speech-to-text-to-teto |
(No License) Inspiration and blueprint for the internal .ust-generating script. |
| API & Deployment | FastAPI | (MIT) The framework for building all public (commercial) and internal (streaming) API endpoints. |
This project stands on the shoulders of many contributors. We wish to extend our deepest thanks to:
- missile_39 and the RepliVoice Team: For creating the pioneering Adachi Rei voice. Their successful development of a fully synthetic voice and commercialization of the RepliVoice technology inspired the core "no-human-inside" philosophy of this engine. The ALIZARIN Engine is designed to be an original work that builds upon this concept, not an remake or copy of the distinct qualities of the Adachi Rei voice. They are a great inspiration and motivator to me.
- The Developers of DiffSinger: We specifically thank Jinglin Liu (MoonInTheRiver) and the research team for creating DiffSinger, the technology at the heart of our AI singing product.
- Original Author's Repository: MoonInTheRiver/DiffSinger
- Toolkit Implementation: We utilize the DiffSinger implementation provided by the Amphion toolkit.
- acrylicc: For their
speech-to-text-to-tetoproject, which provided the conceptual blueprint for the real-time UTAU-TTS/singing API. - stakira and the OpenUTAU community: For creating
OpenUTAUandWavRS, which remain the heart of the concatenative synthesis world. - The Developers of Pyo: For creating the powerful DSP tools that make the formant engine possible.
- The Developers of MeloTTS: For providing a high-quality, commercially-permissive TTS model.