Improvements to ACE-Steps 1.5 text encoding by blepping · Pull Request #12283 · Comfy-Org/ComfyUI

blepping · 2026-02-04T17:28:18Z

ACE-Steps 1.5 text encoding is pretty broken right now, unfortunately. I am not positive these changes are perfect/a complete fix. I didn't have a whole lot of time to work on it, and as a result of that this is also very lightly tested.

Results with these changes seem to result in much better output. Here's something fun to listen to: https://voca.ro/11TKOC8Jgebf

Given a caption Blah blah user caption and lyrics [Instrumental] this is debug output from the the current implementation:

Raw debug output

TOKENIZING: '<|im_start|>system\n# Instruction\nGenerate audio semantic tokens based on the given conditions:\n\n<|im_end|>\n<|im_start|>user\n# Caption\nBlah blah user caption\n[Instrumental]\n<|im_end|>\n<|im_start|>assistant\n<think>\nbpm: 160\nduration: 175\nkeyscale: D major\ntimesignature: 3\n</think>\n\n<|im_end|>\n'

TOKENIZING: '<|im_start|>system\n# Instruction\nGenerate audio semantic tokens based on the given conditions:\n\n<|im_end|>\n<|im_start|>user\n# Caption\nBlah blah user caption\n[Instrumental]\n<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n<|im_end|>\n'

TOKENIZING: '# Languages\nzh\n\n# Lyric[Instrumental]<|endoftext|><|endoftext|>`

TOKENIZING: '# Instruction\nGenerate audio semantic tokens based on the given conditions:\n\n# Caption\nBlah blah user caption# Metas\n- bpm: 160\n- timesignature: 3\n- keyscale: D major\n- duration: 175\n<|endoftext|>\n<|endoftext|>'

That's pretty hard to read, so printed out as strings:

`lm_prompt`

<|im_start|>system
# Instruction
Generate audio semantic tokens based on the given conditions:

<|im_end|>
<|im_start|>user
# Caption
Blah blah user caption
[Instrumental]
<|im_end|>
<|im_start|>assistant
<think>
bpm: 160
duration: 175
keyscale: D major
timesignature: 3
</think>

<|im_end|>

Serious issues here:

There is no # Lyric section, the lyrics are just run together with the caption.
The CoT part is supposed to be YAML encoded with sorted keys and include the caption as well. I think. The both existing implementations are a complex maze that is difficult to follow.

Refs:

`lm_prompt_negative`

<|im_start|>system
# Instruction
Generate audio semantic tokens based on the given conditions:

<|im_end|>
<|im_start|>user
# Caption
Blah blah user caption
[Instrumental]
<|im_end|>
<|im_start|>assistant
<think>

</think>

<|im_end|>

`lyrics`

# Languages
zh

# Lyric[Instrumental]<|endoftext|><|endoftext|>

Serious issues here:

The lyric section is run together with the lyric section header. No wonder the model has issues with lyric conformance!

`qwen3_06b`

# Instruction
Generate audio semantic tokens based on the given conditions:

# Caption
Blah blah user caption# Metas
- bpm: 160
- timesignature: 3
- keyscale: D major
- duration: 175
<|endoftext|>
<|endoftext|>

Serious issues here:

The caption is run together with the metas section (unless the user has a newline at the end of the lyrics and the code doesn't strip the whitespace, I didn't check but most users wouldn't expect that to be necessary).
Duration is supposed to be like 175 seconds, not just a bare number.

Refs:

https://github.com/ace-step/ACE-Step-1.5/blob/4ab3630fd6c8868dd98bf1a9dde0fe4cc54674f3/acestep/handler.py#L797
https://github.com/ace-step/ACE-Step-1.5/blob/4ab3630fd6c8868dd98bf1a9dde0fe4cc54674f3/acestep/handler.py#L1179 - this maybe implies it should use the "123 seconds" format everywhere. I am not positive so I didn't mess with other places.

With this pull, we get the output:

Raw debug output

TOKENIZING: '<|im_start|>system\n# Instruction\nGenerate audio semantic tokens based on the given conditions:\n\n<|im_end|>\n<|im_start|>user\n# Caption\nBlah blah user caption\n# Lyric\n[Instrumental]\n<|im_end|>\n<|im_start|>assistant\n<think>\nbpm: 160\ncaption: Blah blah user caption\nduration: 175\nkeyscale: D major\nlanguage: zh\ntimesignature: 3\n</think>\n<|im_end|>\n'

TOKENIZING: '<|im_start|>system\n# Instruction\nGenerate audio semantic tokens based on the given conditions:\n\n<|im_end|>\n<|im_start|>user\n# Caption\nBlah blah user caption\n# Lyric\n[Instrumental]\n<|im_end|>\n<|im_start|>assistant\n<think>\n</think>\n<|im_end|>\n'

TOKENIZING: '# Languages\nzh\n\n# Lyric\n[Instrumental]<|endoftext|><|endoftext|>'

TOKENIZING: '# Instruction\nGenerate audio semantic tokens based on the given conditions:\n\n# Caption\nBlah blah user caption\n# Metas\n- bpm: 160\n- duration: 175 seconds\n- keyscale: D major\n- timesignature: 3\n<|endoftext|>\n<|endoftext|>'

`lm_prompt`

<|im_start|>system
# Instruction
Generate audio semantic tokens based on the given conditions:

<|im_end|>
<|im_start|>user
# Caption
Blah blah user caption
# Lyric
[Instrumental]
<|im_end|>
<|im_start|>assistant
<think>
bpm: 160
caption: Blah blah user caption
duration: 175
keyscale: D major
language: zh
timesignature: 3
</think>
<|im_end|>

`lm_prompt_negative`

<|im_start|>system
# Instruction
Generate audio semantic tokens based on the given conditions:

<|im_end|>
<|im_start|>user
# Caption
Blah blah user caption
# Lyric
[Instrumental]
<|im_end|>
<|im_start|>assistant
<think>
</think>
<|im_end|>

`lyrics`

# Languages
zh

# Lyric
[Instrumental]<|endoftext|><|endoftext|>

`qwen3_06b`

# Instruction
Generate audio semantic tokens based on the given conditions:

# Caption
Blah blah user caption
# Metas
- bpm: 160
- duration: 175 seconds
- keyscale: D major
- timesignature: 3
<|endoftext|>
<|endoftext|>

No newline between the # Caption and # Metas section looks a bit weird, however the official template would have the same result if there wasn't a trailing newline: https://github.com/ace-step/ACE-Step-1.5/blob/eafcc2098696c60fb9e35d91813d84282d78959a/acestep/constants.py#L101

Potential issues this pull doesn't address:

I am suspicious of the the double <|endoftext|> tokens at the ends of some of the prompts. Also one has a newline in between the tokens, one doesn't.
Duration possibly should have a unit seconds in the places it appears. I only fixed the one place I was pretty sure about.
Maybe ensuring a double space between the # Caption and # Metas sections in the SFT prompt would be a good idea.
I am pretty sure the number of LM codes can be mismatched with the latent size since the audio codes output isn't going to be exact (as far as I know) but something roughly at the specified duration. You're likely to get very suboptimal results with LM codes for a, let's say, 2 minute song when you only provide a 1:45 minute latent. It would probably make sense to have a node to make an empty latent based on the conditioning after it was generated to ensure the sizes are copacetic.
Forcing a specific time signature, key, language, etc always, no matter what is not how the official model works and severely limits creativity/diversity in results. There is also no way to change stuff like the LM negative prompt, CFG scale, sampling parameters, etc.
The model also doesn't even require the LM codes part, but the existing implementation (not talking about TE here) doesn't even have a code path that will handle the LM codes not being present.
The current implementation is very, very basic (aside from actual issues like malformed prompt encoding) and is going to produce much worse results than the official implementation even for the features it does support. Hopefully this will be improved, but since ACE 1.5 never even supported languages other than English and Japanese so being optimistic may not be warranted.

Unfortunately, I don't really have time at the moment to do more than whine about those other issues.

zwukong · 2026-02-05T03:50:04Z

The main problem that i discovered now is that there are too many missing words. Sometimes, a single paragraph is gone. some tags have no effects, such as male voice. Another issue is that it's rather slow, and the time cannot be set to auto yet. The sound quality is fine

zwukong · 2026-02-05T06:22:38Z

Issues still the same , in my tests,nothing changed. A whole paragraph's gone still can happen. A lot of missing words. [Male Vocal] still has no effect.
Env: qwen 4B, turbo, 312 2.9cu130

gibru · 2026-02-10T06:21:03Z

In my tests, prompt adherence has mixed results between v0.12.2 and v0.12.3. Still playing with both versions to figure out the nuances, especially because the text encoder for Ace has some gained some additional settings (previously not exposed?).

One example: In v0.12.2 I could start a song with a tag like [Saxophone Intro] and it would work. Using identical settings (incl. kSampler), v0.12.3 ignores it completely. At the same time I get the feeling that in other areas v0.12.3 has some improvements in prompt adherence. However, when it comes to taste, for my kind of prompting v0.12.2 produces the better results overall. Sound quality seems to have improved at least in v0.12.3+ For reference: ace-step-v1.5-sft + dual clip with 0.6b and 1.7b.

zwukong · 2026-02-10T07:11:58Z

very intertesting , without llm is better, better than 1.5 , even 4B,at least male voice can show 😄

Improvements to ACE-Steps 1.5 text encoding

32dd91c

blepping requested review from Kosinkadink, comfyanonymous and guill as code owners February 4, 2026 17:28

Merge branch 'master' into improve_ace15_te

a510a56

comfyanonymous merged commit a246cc0 into Comfy-Org:master Feb 5, 2026
12 checks passed

luna-niemitalo pushed a commit to luna-niemitalo/ComfyUI that referenced this pull request Feb 11, 2026

Improvements to ACE-Steps 1.5 text encoding (Comfy-Org#12283)

9226dfc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements to ACE-Steps 1.5 text encoding#12283

Improvements to ACE-Steps 1.5 text encoding#12283
comfyanonymous merged 2 commits intoComfy-Org:masterfrom
blepping:improve_ace15_te

blepping commented Feb 4, 2026

Uh oh!

zwukong commented Feb 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

zwukong commented Feb 5, 2026 •

edited

Loading

Uh oh!

gibru commented Feb 10, 2026

Uh oh!

zwukong commented Feb 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

blepping commented Feb 4, 2026

Raw debug output

lm_prompt

lm_prompt_negative

lyrics

qwen3_06b

Raw debug output

lm_prompt

lm_prompt_negative

lyrics

qwen3_06b

Uh oh!

zwukong commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

zwukong commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gibru commented Feb 10, 2026

Uh oh!

zwukong commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

`lm_prompt`

`lm_prompt_negative`

`lyrics`

`qwen3_06b`

`lm_prompt`

`lm_prompt_negative`

`lyrics`

`qwen3_06b`

zwukong commented Feb 5, 2026 •

edited

Loading

zwukong commented Feb 5, 2026 •

edited

Loading

zwukong commented Feb 10, 2026 •

edited

Loading