-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Coqui/Mozilla TTS support #217
Comments
Update: I have tested Coqui TTS. The voice quality is impressive, and quite stable. Using the "glow" model, I can synthesize a 500-character paragraph in about 5.6 seconds. Oddly, it takes about the same time whether I use Torch (cpu) or Torch (cu102) on my GTX 1070. 5.6s unfortunately isn't scalable for us. The python script also adds an additional 10s of startup time, which could be eliminated by running it as a service, but that'd take some work. Finally, the models are not handling acronyms properly. They're using the "gruut" phonemizer library, which I guess is pronouncing the acronyms rather than spelling them out. A workaround is to preprocess the text and "spell out" the acronyms, e.g. TTS would become "tee tee ess". |
Hi @ken107 I am using Coqui TTS (on a hacked-up version of my Riva TTS Proxy server, nothing to do with Riva, but reusing the transcoding and making it compatible with the read-aloud integration we've added) a Persian/Farsi model. The model works quite well and outcompetes the SAPI 5 voice I had been using (Dariush Premium by Harpo Software) which throws a COM Error on a lot of new slang that I guess it had never been trained on directly, whereas the neural network based stuff pronounces rare words just fine. Interestingly, it rarely pronounces the same word in exactly the same way twice. I was thinking maybe I should refactor that server project to not be Riva-specific and be able to house Coqui models as well. The current architecture I have now is a hacked version of the Riva TTS Proxy exposes a Persian voice to ReadAloud which points to another server here https://github.com/kfatehi/persian-tts-server (let's call this the Coqui model worker) which actually runs the Coqui voice. Coqui streams its WAV which the TTS Proxy transcodes as usual in the way ReadAloud expects. One idea is to make https://github.com/kfatehi/riva_tts_proxy more generic so that it just takes care of the network communication to new backends, transcoding those backends, managing these new voices, and easy consumption by frontends ReadAloud-like clients in their preferred format. That way, users can deploy model workers (e.g. persian-tts-server, the riva stack, sapi 5 exposers, etc) and then enable them in the TTS proxy... then ReadAloud can just enumerate the voices exposed by the proxy. Just an idea, let me know what you think. Right now I have a mess of proofs-of-concepts so I'll be waiting for feedback to know in which direction to cleanup and make things more shareable, but that's the idea I have in mind so far. |
I took a look at https://github.com/RHVoice/RHVoice Looks like the voices are exposed via SAPI 5, which I wrote a backend for in the proxy (to access that Dariush voice). I found this library at first: pyttsx3 but it did not implement in-memory buffer/streaming of the WAV. I don't know why RHVoice does not talk about a Mac OS standard voice, but it does describe a Linux one, so I could explore adding that too. Doesn't ReadAloud already utilize SAPI 5 ? How else does it have access to the offline Microsoft voices? If this is true, then RHVoice should work out of the box. I suspect there is something missing here, though, because I did not see Dariush (SAPI 5 voice) which caused me to learn SAPI 5 in detail. |
I noticed that Nvidia Riva also has this problem. I haven't tried to address it but I like your idea. I don't think it works in all situations but most enough to be worth implementing. |
Here is some performance information with the Persian model I am using (https://github.com/karim23657/Persian-tts-coqui persian-tts-female-vits). Note that I am using an Nvidia 4090. The model takes about 6 seconds to load on initial startup before it's ready to handle synthesis requests.
|
It would be great if on custom voices one could use a Coqui TTS server to serve natural sounding voices either locally or remotelly.
The text was updated successfully, but these errors were encountered: