Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Sentence Splitting with API #436

Closed
GoudaCouda opened this issue Dec 3, 2024 · 5 comments
Closed

[BUG] Sentence Splitting with API #436

GoudaCouda opened this issue Dec 3, 2024 · 5 comments

Comments

@GoudaCouda
Copy link

There seems to be an issue with the API when using sentence splitting. Open Web UI has an option where it will split the sentences and send them by either punctuation or paragraphs. When this is active the all talk API sends messages out very slowly. It doesnt seem like it can take multiple strings at a time. It reads the first split and then waits and then I hear the next in about 5 seconds.

This does not seem to be an issue with Deep Seed as I tested with and without. This function also tends to cause the Audio to glitch out i just hear glitchy sounds

@erew123
Copy link
Owner

erew123 commented Dec 4, 2024

@GoudaCouda

With which underlying TTS engine are you experiencing the problems with? (F5-TTS, PIPER, XTTS etc).
What format are you asking for TTS to be generated in? (MP3, Wav etc)
Any idea how much text you are sending over? (Characters)

Someone reported a similar issue #410

thanks

@GoudaCouda
Copy link
Author

I was using a Wav with a finetuned XTTS Model. IT didn't really seem to matter how much text I sent over it was doing it no matter what I was sending or receiving. I think it happened with the base XTTS model as well but Id have to double check that.

@erew123
Copy link
Owner

erew123 commented Dec 4, 2024

@GoudaCouda Ive just checked this by sending this block of text to the OpenAI endpoint

When the doctors asked him if he wanted to hop into a frying pan, Alexander of Severe Level III, long acquainted with doctor tongue, knew they meant they wanted to trial a new drug or a new treatment. Alexander said okay because he was tired of being shut up all day in the institution, and wanted to travel to the testing lab. More importantly, he had lost a central piece of himself to shrapnel when he stepped on a landmine eleven years earlier, and continued to perplex what was left of his cerebral cortex with the hope of its recovery.

The doctors assured Alexander that the new treatment, which had to be administered far from the city to avoid electrical effects, had worked well on animal subjects, and if Alexander of Severe Level III would scrawl his agreement on the bottom of a standard form, he could find his way home again.

Home seemed a dream, but a pleasant one.

Once, when he was a small child at the county fair, Alexander, then of Innocence, had far wandered from the hand of his mother who was busy with cotton candy or ice cream. He had been blinded by the midway bulbs, deafened by the shouts of shills, and had begged a tall cowboy for help. Home at that time was not only a pleasant dream, but the possibility of its loss a nightmare.

Now, absent an essential piece of himself, and with the prospect of finding it again without the need to return to the war jungle of snakes and booby traps, Alexander scrawled his agreement, and climbed into an institution van with a doctor, and a driver who somehow resembled a beautiful cow. They both wore important badges and faces.

You are welcome to test this yourself. Inside the \alltalk_tts\system folder is openaittstest.html and you can just lick on that and run it.

image

Your issue is not AllTalk or the OpenAI endpoint.

If you send over the example above block of text, it will go over to the OpenAI endpoint in 1x singe TTS generation request.

If you are using some software, in your case Open Web UI to pre-split that into multiple TTS generation requests, then instead of sending it as the 1x TTS Generation, it will of course send multiple TTS generation requests and of course play back choppy in-between generation requests. This is down to how you are sending the TTS requests, not how the TTS is being generated. Here is a simple as explanation as I can give you:

This is sentence 1. This is sentence 2. This is sentence 3. This is the end of this paragraph > TTS Generation > Play audio

If you send that as 1x TTS generation request, it will take, lets say 5 seconds and you get 1x audio file back to play.

If however you pre-split it as you are doing it gets sent like this:

This is sentence 1. > TTS Generation > Play audio
This is sentence 2. > TTS Generation > Play audio
This is sentence 3. > TTS Generation > Play audio
This is the end of this paragraph > TTS Generation > Play audio

@erew123
Copy link
Owner

erew123 commented Dec 4, 2024

Sorry, hit send before finishing that.

This is sentence 1. > TTS Generation (2 seconds) > Play audio (2 seconds)
This is sentence 2. > TTS Generation (2 seconds) > Play audio (2 seconds)
This is sentence 3. > TTS Generation (2 seconds) > Play audio (2 seconds)
This is the end of this paragraph > TTS Generation (2 seconds) > Play audio (2 seconds)

So by pre-splitting it, you are breaking it into multiple TTS requests and multiple playback requests, rather than one flowing request.

This is not a Fault of the OpenAI endpoint or the TTS generation, but more how you are sending over the request.

I either suggest you send it all as one, OR you can ask the people at "Open Web UI" to look into a difference cache management behaviour with how Open Web UI handles sending/generating multiple requests and buffering up multiple requests in their software.

Hope that helps

Thanks

@erew123 erew123 closed this as completed Dec 4, 2024
@erew123
Copy link
Owner

erew123 commented Dec 4, 2024

Just to be 100% clear, AllTalk has no concept of what the software making a TTS generation request is doing. It just generates the TTS it is requested to generate and sends it back. So if you send multiple TTS generation requests, AllTalk will generate multiple wav files and send them back as quickly as it can. AllTalk doesnt know your software sliced up a paragraph, it just sends the generated audio back for the TTS request that was made, which in your case, with pre-splitting your sentences, means Open Web UI is sending multiple TTS generation requests and getting back multiple TTS generation requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants