-
-
Notifications
You must be signed in to change notification settings - Fork 135
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
alltalk v2 api: allow splitting into manageable chunks for the tts engine? #410
Comments
HI @danielw97 What size/amount of text are you talking about? And I do already have code within AllTalk for splitting up and merging, though the OpenAI endpoint doesnt pass through it, and you are correct that (to my knowledge) F5-TTS does split up its text and compile it back (supposedly). LMK on the sort of text length you think it has issues at! Thanks |
@erew123 that makes sense. |
@danielw97 Yeah 4096 is a limit of the OpenAI V1 API speech endpoint... I mean I could allow more, but it wouldnt be OpenAI standard then theres something somewhere on their platform about it, but I cant remember where, so heres my ref therefore https://community.openai.com/t/tts-with-more-than-4096-characters/591842 I mean setting up a chunking system would be possible without too much issue. I've re-written the whole inside of AllTalk and launched it today (just working my way through a couple of issues and trying to get google colab sorted/working ATM). I can add a chunking to the list of TODO's, probably wouldnt take too long, but, Im just cracking through this other stuff right at this second and have a decent list I want to get done before I work on too much more. Google colab, docker checking then Ive part written a reverse proxy for people who want HTTPS, so Ive got that, then all new atsetups with extra error output etc... so on/so forth hah!! Ohhhh and Ive 70% written the guide on adding new TTS engines. Still have that to do, BUT, what I did do in amongst all of getting this release ready was to re-write one of the entire TTS engine files (for want of a better term) so that I could make a template from that https://github.com/erew123/alltalk_tts/blob/alltalkbeta/system/tts_engines/xtts/model_engine.py So the whole code is cleaner and there is a template in the template folder built off that. I have to tidy up 2-3 other files and finish the guide and all should be good to go! P.S. that model_engine isnt as scary as it looks...its actually pretty dam simple now to be honest and most of the code you dont need to touch. So I will have that one one day soon. So re-this, you ok if I add it on to the feature requests? |
No problem at all adding this to the feature request list, personally I'm not in a big rush although it was something I spotted so wanted to bring it up so you are aware. |
Added it to the list will look at it at some point |
@danielw97 FYI, I think ... think this is resolved. I suspect it was a FFmpeg issue with transcoding. I've only just returned from family duties and had time to test how this impacted the OpenAI endpoint, commit here aa794e4 Long story short, I tried generating with bang on 4096 characters and of course the F5-TTS engine (no changes made to that in AllTalk's code). It generated fine with no issues. No drops, stutter, jitters, quirks or oddities. As you can see from the image below, 4096 on the dot and all was good when listening back. Obviously this is a git pull and then updating the requirements. |
Is your feature request related to a problem? Please describe.
Currently, it appears as though the openai compatible api sends all of the text to the speech engine, regardless of the limitations of said engine.
For f5tts as an example, I believe the inference app splits the text by sentences with about 135 characters or so total, by sentances.
There are unexpected results when long texts are passed.
Describe the solution you'd like
This may be outside of the scope of your project, but in an ideal world it would be great if the api could pass text to the speech engine in manageable batches similar to how the tts generator does it.
Edit: here's the code from f5tts to show you what I mean, this is used for both the web and cli inference scripts.:
Describe alternatives you've considered
As I know your time to work on this project is limited, I completely understand if it is preferred to shelve something like this for the moment, but wanted to suggest it as it's been something I've ran into myself.
Additional context
n/a
The text was updated successfully, but these errors were encountered: