Skip to content

Iimprovements to ACE-Steps 1.5 text encoding (part 2)#12350

Merged
comfyanonymous merged 5 commits intoComfy-Org:masterfrom
blepping:improve_ace15_te_pt2
Feb 10, 2026
Merged

Iimprovements to ACE-Steps 1.5 text encoding (part 2)#12350
comfyanonymous merged 5 commits intoComfy-Org:masterfrom
blepping:improve_ace15_te_pt2

Conversation

@blepping
Copy link
Contributor

@blepping blepping commented Feb 7, 2026

This pull attempts (hopefully, successfully) to:

  • Skip calculating the uncond batch when CFG is set to 1. Running the LM without CFG doesn't seem to help much with performance and the results are pretty bad, so I'm not sure exactly when someone would want to do this but it makes sense not to calculate a batch that just isn't used. It may reduce memory consumption for memory limited people.
  • It adds some backend support for more flexible encoding. Third party nodes can now do stuff like set a negative prompt, set min/max LM codes limits (forcing the LM to generate exactly a certain number seems like it often causes problems in my testing), use time or key signatures the built in node doesn't allow, etc.
  • Attempts to clean up the code a bit and make it easier to read/modify.
  • Adds console progress output via TQDM like other sampling functions.

I have some ACE related nodes here which have initial support for this: https://gist.github.com/blepping/d0f6a26b1f59ed705999945821a3ee8a

For example, with the backend support you can now do something like this:

image

The compress node strips off duplicate codes (getting like 20 or so of the same code at the end is pretty common, usually it's just what I assume is the code for silence). Since the number of codes is dynamic when using a flexible min/max token limit (or possibly pruning codes), it's necessary to calculate the latent size from the number of codes.

I was hesitant to mess with ComfyUI's built in text encoding node because adding/reordering fields breaks workflows with how ComfyUI uses order/slot based associations when loading a workflow rather than keys (unless something has changed recently) but at least with this backend support custom node authors have a way to add extended functionality without having to completely reimplement text encoding.

I've done a number of generations using these features and they don't seem to cause any issues but I can't say it's extensively tested and I'm not sure I fully understood how the sampling function works. (Since ComfyUI requires transformers it would probably make sense to use the logits processing functions from Transformers like top-p, etc. Sampling seems like it could really benefit from extended features like repetition or n-gram penalties which Transformers provides.)

The te_debug thing is a bit hacky and can be removed (or changed) if necessary. It is really useful to be able to see exactly what is getting encoded though and there isn't really another way to do that without having to change ComfyUI's TE source files. edit: I just removed it.

@blepping blepping force-pushed the improve_ace15_te_pt2 branch from 8282a28 to 7f1094c Compare February 7, 2026 18:46
out["qwen3_06b"] = self.qwen3_06b.tokenize_with_weights("# Instruction\nGenerate audio semantic tokens based on the given conditions:\n\n# Caption\n{}\n# Metas\n{}\n<|endoftext|>\n<|endoftext|>".format(text, meta_cap), return_word_ids, **kwargs)
out["lm_metadata"] = {"min_tokens": duration * 5,
out = {
prompt_key: self.qwen3_06b.tokenize_with_weights(prompt, prompt_key == "qwen3_06b" and return_word_ids, **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to put: disable_weights=True here or else it's going to strip out the () for token weights.

Copy link
Contributor Author

@blepping blepping Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, that was a dumb mistake. The original code actually didn't have that enabled when encoding the qwen3_06b key. Was that correct/behavior that should be preserved or do we want to always set disable_weights=True? edit: Assuming the original implementation version was correct, this now should replicate that behavior and does not set disable_weights when encoding qwen3_06b.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you test if the weights actually work? if they don't it's better to just disable them on all the encoders.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I didn't try it because generally that doesn't work for LLMs. I think it makes sense to just disable them globally since there is no practical way to pass separate prompts for the different parts. So if one of them used prompt weighting and one used the literal characters, you'd be parsing stuff incorrectly in one of the cases.

The last commit has some additional changes to try to get TE encoding as close to the official implementation as possible. It seems like I was wrong to add the caption to the CoT even though the code apparently does add it. With these changes, comparing the official Gradio app (on left) to ComfyUI for a dumb LLM generated sample prompt:

LM

image

DIT

image

I still don't believe in the double EoT tokens but the official app didn't dump the lyrics encoding or negative LM prompt (if it encoded it, not sure) and I don't have time to actually add debug code to that so I didn't mess with it.

I forced the language to not get passed using my node, unfortunately there isn't a way to avoid passing the language in with ComfyUI's current conditioning encode node so there would be something like language: en in the metas.

This is as close as I can get it to the official app for now. I'm also trusting that it's not lying to me with the debug output. There is some weird stuff in the official app, like: https://github.com/ace-step/ACE-Step-1.5/blob/08f56edd7fc68452608933706c59da5102bc602e/acestep/handler.py#L2377 - where it appears to add the lyrics text (with # Languages) section to a text prompt after it dumped the debug info for the prompt. I don't know if that's just informational or if it actually gets encoded and used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you test if the weights actually work?

I tested this. Even if the weights are set in the token/weight pairs, it doesn't seem like encode_from_tokens_scheduled respects them. I could verify that a token had an absurd weight like 2000 in the token weight pairs but it didn't actually affect anything.

Reorder meta caps to match official implementation.
Remove caption from CoT.
Ensure sections are double spaced for the LM prompt.
Also strip the /4 from timesignature in meta caps.
@comfyanonymous comfyanonymous merged commit baf8c87 into Comfy-Org:master Feb 10, 2026
12 checks passed
@comfyanonymous
Copy link
Member

I double checked it and with this: #12376

It matches the ref implementation completely.

luna-niemitalo pushed a commit to luna-niemitalo/ComfyUI that referenced this pull request Feb 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants