Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alltalk v2 api: allow splitting into manageable chunks for the tts engine? #410

Closed
danielw97 opened this issue Nov 18, 2024 · 6 comments
Closed

Comments

@danielw97
Copy link

danielw97 commented Nov 18, 2024

Is your feature request related to a problem? Please describe.
Currently, it appears as though the openai compatible api sends all of the text to the speech engine, regardless of the limitations of said engine.
For f5tts as an example, I believe the inference app splits the text by sentences with about 135 characters or so total, by sentances.
There are unexpected results when long texts are passed.
Describe the solution you'd like
This may be outside of the scope of your project, but in an ideal world it would be great if the api could pass text to the speech engine in manageable batches similar to how the tts generator does it.
Edit: here's the code from f5tts to show you what I mean, this is used for both the web and cli inference scripts.:

# chunk text into smaller pieces


def chunk_text(text, max_chars=135):
    """
    Splits the input text into chunks, each with a maximum number of characters.

    Args:
        text (str): The text to be split.
        max_chars (int): The maximum number of characters per chunk.

    Returns:
        List[str]: A list of text chunks.
    """
    chunks = []
    current_chunk = ""
    # Split the text into sentences based on punctuation followed by whitespace
    sentences = re.split(r"(?<=[;:,.!?])\s+|(?<=[;:,。!?])", text)

    for sentence in sentences:
        if len(current_chunk.encode("utf-8")) + len(sentence.encode("utf-8")) <= max_chars:
            current_chunk += sentence + " " if sentence and len(sentence[-1].encode("utf-8")) == 1 else sentence
        else:
            if current_chunk:
                chunks.append(current_chunk.strip())
            current_chunk = sentence + " " if sentence and len(sentence[-1].encode("utf-8")) == 1 else sentence

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

Describe alternatives you've considered
As I know your time to work on this project is limited, I completely understand if it is preferred to shelve something like this for the moment, but wanted to suggest it as it's been something I've ran into myself.
Additional context
n/a

@erew123
Copy link
Owner

erew123 commented Nov 24, 2024

HI @danielw97 What size/amount of text are you talking about? And I do already have code within AllTalk for splitting up and merging, though the OpenAI endpoint doesnt pass through it, and you are correct that (to my knowledge) F5-TTS does split up its text and compile it back (supposedly).

LMK on the sort of text length you think it has issues at!

Thanks

@danielw97
Copy link
Author

@erew123 that makes sense.
Basically, one of the apps that I use https://github.com/p0n1/epub_to_audiobook is able to send text to openai, and I am able to hook into that locally by setting my OPENAI_BASE_URL to my alltalk address.
It's working great apart from the fact that it sends 4096 characters by default which produces strange results with f5tts.
I believe that silly tavern can also run into this problem if it sends to the api, although I've not tested that yet.
In an ideal world it would be great at some point to have a queuing mechanism set up for the api, so it only sends a specified length of text to the tts engines at a time, this could maybe be user configurable?
I've been looking at the f5tts architecture and it only supports a combined reference/output generation of 30 seconds, which is why in their apps it's limited to 135 characters.
Also, I'm loving the work on this project and am going to look at intigrating styletts when you have finished with your refactoring.

@erew123
Copy link
Owner

erew123 commented Nov 25, 2024

@danielw97 Yeah 4096 is a limit of the OpenAI V1 API speech endpoint... I mean I could allow more, but it wouldnt be OpenAI standard then theres something somewhere on their platform about it, but I cant remember where, so heres my ref therefore https://community.openai.com/t/tts-with-more-than-4096-characters/591842

I mean setting up a chunking system would be possible without too much issue. I've re-written the whole inside of AllTalk and launched it today (just working my way through a couple of issues and trying to get google colab sorted/working ATM).

I can add a chunking to the list of TODO's, probably wouldnt take too long, but, Im just cracking through this other stuff right at this second and have a decent list I want to get done before I work on too much more. Google colab, docker checking then Ive part written a reverse proxy for people who want HTTPS, so Ive got that, then all new atsetups with extra error output etc... so on/so forth hah!!

Ohhhh and Ive 70% written the guide on adding new TTS engines. Still have that to do, BUT, what I did do in amongst all of getting this release ready was to re-write one of the entire TTS engine files (for want of a better term) so that I could make a template from that https://github.com/erew123/alltalk_tts/blob/alltalkbeta/system/tts_engines/xtts/model_engine.py

So the whole code is cleaner and there is a template in the template folder built off that. I have to tidy up 2-3 other files and finish the guide and all should be good to go! P.S. that model_engine isnt as scary as it looks...its actually pretty dam simple now to be honest and most of the code you dont need to touch. So I will have that one one day soon.

So re-this, you ok if I add it on to the feature requests?

@danielw97
Copy link
Author

No problem at all adding this to the feature request list, personally I'm not in a big rush although it was something I spotted so wanted to bring it up so you are aware.

@erew123
Copy link
Owner

erew123 commented Nov 25, 2024

Added it to the list will look at it at some point

@erew123
Copy link
Owner

erew123 commented Dec 12, 2024

@danielw97 FYI, I think ... think this is resolved. I suspect it was a FFmpeg issue with transcoding. I've only just returned from family duties and had time to test how this impacted the OpenAI endpoint, commit here aa794e4

Long story short, I tried generating with bang on 4096 characters and of course the F5-TTS engine (no changes made to that in AllTalk's code). It generated fine with no issues. No drops, stutter, jitters, quirks or oddities. As you can see from the image below, 4096 on the dot and all was good when listening back.

Obviously this is a git pull and then updating the requirements.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants