Iimprovements to ACE-Steps 1.5 text encoding (part 2)#12350
Iimprovements to ACE-Steps 1.5 text encoding (part 2)#12350comfyanonymous merged 5 commits intoComfy-Org:masterfrom
Conversation
8282a28 to
7f1094c
Compare
comfy/text_encoders/ace15.py
Outdated
| out["qwen3_06b"] = self.qwen3_06b.tokenize_with_weights("# Instruction\nGenerate audio semantic tokens based on the given conditions:\n\n# Caption\n{}\n# Metas\n{}\n<|endoftext|>\n<|endoftext|>".format(text, meta_cap), return_word_ids, **kwargs) | ||
| out["lm_metadata"] = {"min_tokens": duration * 5, | ||
| out = { | ||
| prompt_key: self.qwen3_06b.tokenize_with_weights(prompt, prompt_key == "qwen3_06b" and return_word_ids, **kwargs) |
There was a problem hiding this comment.
You need to put: disable_weights=True here or else it's going to strip out the () for token weights.
There was a problem hiding this comment.
Ah, sorry, that was a dumb mistake. The original code actually didn't have that enabled when encoding the qwen3_06b key. Was that correct/behavior that should be preserved or do we want to always set disable_weights=True? edit: Assuming the original implementation version was correct, this now should replicate that behavior and does not set disable_weights when encoding qwen3_06b.
There was a problem hiding this comment.
did you test if the weights actually work? if they don't it's better to just disable them on all the encoders.
There was a problem hiding this comment.
No, I didn't try it because generally that doesn't work for LLMs. I think it makes sense to just disable them globally since there is no practical way to pass separate prompts for the different parts. So if one of them used prompt weighting and one used the literal characters, you'd be parsing stuff incorrectly in one of the cases.
The last commit has some additional changes to try to get TE encoding as close to the official implementation as possible. It seems like I was wrong to add the caption to the CoT even though the code apparently does add it. With these changes, comparing the official Gradio app (on left) to ComfyUI for a dumb LLM generated sample prompt:
LM
DIT
I still don't believe in the double EoT tokens but the official app didn't dump the lyrics encoding or negative LM prompt (if it encoded it, not sure) and I don't have time to actually add debug code to that so I didn't mess with it.
I forced the language to not get passed using my node, unfortunately there isn't a way to avoid passing the language in with ComfyUI's current conditioning encode node so there would be something like language: en in the metas.
This is as close as I can get it to the official app for now. I'm also trusting that it's not lying to me with the debug output. There is some weird stuff in the official app, like: https://github.com/ace-step/ACE-Step-1.5/blob/08f56edd7fc68452608933706c59da5102bc602e/acestep/handler.py#L2377 - where it appears to add the lyrics text (with # Languages) section to a text prompt after it dumped the debug info for the prompt. I don't know if that's just informational or if it actually gets encoded and used.
There was a problem hiding this comment.
did you test if the weights actually work?
I tested this. Even if the weights are set in the token/weight pairs, it doesn't seem like encode_from_tokens_scheduled respects them. I could verify that a token had an absurd weight like 2000 in the token weight pairs but it didn't actually affect anything.
Reorder meta caps to match official implementation. Remove caption from CoT. Ensure sections are double spaced for the LM prompt. Also strip the /4 from timesignature in meta caps.
|
I double checked it and with this: #12376 It matches the ref implementation completely. |
This pull attempts (hopefully, successfully) to:
I have some ACE related nodes here which have initial support for this: https://gist.github.com/blepping/d0f6a26b1f59ed705999945821a3ee8a
For example, with the backend support you can now do something like this:
The compress node strips off duplicate codes (getting like 20 or so of the same code at the end is pretty common, usually it's just what I assume is the code for silence). Since the number of codes is dynamic when using a flexible min/max token limit (or possibly pruning codes), it's necessary to calculate the latent size from the number of codes.
I was hesitant to mess with ComfyUI's built in text encoding node because adding/reordering fields breaks workflows with how ComfyUI uses order/slot based associations when loading a workflow rather than keys (unless something has changed recently) but at least with this backend support custom node authors have a way to add extended functionality without having to completely reimplement text encoding.
I've done a number of generations using these features and they don't seem to cause any issues but I can't say it's extensively tested and I'm not sure I fully understood how the sampling function works. (Since ComfyUI requires transformers it would probably make sense to use the logits processing functions from Transformers like top-p, etc. Sampling seems like it could really benefit from extended features like repetition or n-gram penalties which Transformers provides.)
The te_debug thing is a bit hacky and can be removed (or changed) if necessary. It is really useful to be able to see exactly what is getting encoded though and there isn't really another way to do that without having to change ComfyUI's TE source files.edit: I just removed it.