Skip to content

Improvements to ACE-Steps 1.5 text encoding#12283

Merged
comfyanonymous merged 2 commits intoComfy-Org:masterfrom
blepping:improve_ace15_te
Feb 5, 2026
Merged

Improvements to ACE-Steps 1.5 text encoding#12283
comfyanonymous merged 2 commits intoComfy-Org:masterfrom
blepping:improve_ace15_te

Conversation

@blepping
Copy link
Contributor

@blepping blepping commented Feb 4, 2026

ACE-Steps 1.5 text encoding is pretty broken right now, unfortunately. I am not positive these changes are perfect/a complete fix. I didn't have a whole lot of time to work on it, and as a result of that this is also very lightly tested.

Results with these changes seem to result in much better output. Here's something fun to listen to: https://voca.ro/11TKOC8Jgebf

Given a caption Blah blah user caption and lyrics [Instrumental] this is debug output from the the current implementation:

Raw debug output

TOKENIZING: '<|im_start|>system\n# Instruction\nGenerate audio semantic tokens based on the given conditions:\n\n<|im_end|>\n<|im_start|>user\n# Caption\nBlah blah user caption\n[Instrumental]\n<|im_end|>\n<|im_start|>assistant\n<think>\nbpm: 160\nduration: 175\nkeyscale: D major\ntimesignature: 3\n</think>\n\n<|im_end|>\n'

TOKENIZING: '<|im_start|>system\n# Instruction\nGenerate audio semantic tokens based on the given conditions:\n\n<|im_end|>\n<|im_start|>user\n# Caption\nBlah blah user caption\n[Instrumental]\n<|im_end|>\n<|im_start|>assistant\n<think>\n\n</think>\n\n<|im_end|>\n'

TOKENIZING: '# Languages\nzh\n\n# Lyric[Instrumental]<|endoftext|><|endoftext|>`

TOKENIZING: '# Instruction\nGenerate audio semantic tokens based on the given conditions:\n\n# Caption\nBlah blah user caption# Metas\n- bpm: 160\n- timesignature: 3\n- keyscale: D major\n- duration: 175\n<|endoftext|>\n<|endoftext|>'

That's pretty hard to read, so printed out as strings:

lm_prompt

<|im_start|>system
# Instruction
Generate audio semantic tokens based on the given conditions:

<|im_end|>
<|im_start|>user
# Caption
Blah blah user caption
[Instrumental]
<|im_end|>
<|im_start|>assistant
<think>
bpm: 160
duration: 175
keyscale: D major
timesignature: 3
</think>

<|im_end|>

Serious issues here:

  1. There is no # Lyric section, the lyrics are just run together with the caption.
  2. The CoT part is supposed to be YAML encoded with sorted keys and include the caption as well. I think. The both existing implementations are a complex maze that is difficult to follow.

Refs:

lm_prompt_negative

<|im_start|>system
# Instruction
Generate audio semantic tokens based on the given conditions:

<|im_end|>
<|im_start|>user
# Caption
Blah blah user caption
[Instrumental]
<|im_end|>
<|im_start|>assistant
<think>

</think>

<|im_end|>

lyrics

# Languages
zh

# Lyric[Instrumental]<|endoftext|><|endoftext|>

Serious issues here:

  1. The lyric section is run together with the lyric section header. No wonder the model has issues with lyric conformance!

qwen3_06b

# Instruction
Generate audio semantic tokens based on the given conditions:

# Caption
Blah blah user caption# Metas
- bpm: 160
- timesignature: 3
- keyscale: D major
- duration: 175
<|endoftext|>
<|endoftext|>

Serious issues here:

  1. The caption is run together with the metas section (unless the user has a newline at the end of the lyrics and the code doesn't strip the whitespace, I didn't check but most users wouldn't expect that to be necessary).
  2. Duration is supposed to be like 175 seconds, not just a bare number.

Refs:

  1. https://github.com/ace-step/ACE-Step-1.5/blob/4ab3630fd6c8868dd98bf1a9dde0fe4cc54674f3/acestep/handler.py#L797
  2. https://github.com/ace-step/ACE-Step-1.5/blob/4ab3630fd6c8868dd98bf1a9dde0fe4cc54674f3/acestep/handler.py#L1179 - this maybe implies it should use the "123 seconds" format everywhere. I am not positive so I didn't mess with other places.

With this pull, we get the output:

Raw debug output

TOKENIZING: '<|im_start|>system\n# Instruction\nGenerate audio semantic tokens based on the given conditions:\n\n<|im_end|>\n<|im_start|>user\n# Caption\nBlah blah user caption\n# Lyric\n[Instrumental]\n<|im_end|>\n<|im_start|>assistant\n<think>\nbpm: 160\ncaption: Blah blah user caption\nduration: 175\nkeyscale: D major\nlanguage: zh\ntimesignature: 3\n</think>\n<|im_end|>\n'

TOKENIZING: '<|im_start|>system\n# Instruction\nGenerate audio semantic tokens based on the given conditions:\n\n<|im_end|>\n<|im_start|>user\n# Caption\nBlah blah user caption\n# Lyric\n[Instrumental]\n<|im_end|>\n<|im_start|>assistant\n<think>\n</think>\n<|im_end|>\n'

TOKENIZING: '# Languages\nzh\n\n# Lyric\n[Instrumental]<|endoftext|><|endoftext|>'

TOKENIZING: '# Instruction\nGenerate audio semantic tokens based on the given conditions:\n\n# Caption\nBlah blah user caption\n# Metas\n- bpm: 160\n- duration: 175 seconds\n- keyscale: D major\n- timesignature: 3\n<|endoftext|>\n<|endoftext|>'

lm_prompt

<|im_start|>system
# Instruction
Generate audio semantic tokens based on the given conditions:

<|im_end|>
<|im_start|>user
# Caption
Blah blah user caption
# Lyric
[Instrumental]
<|im_end|>
<|im_start|>assistant
<think>
bpm: 160
caption: Blah blah user caption
duration: 175
keyscale: D major
language: zh
timesignature: 3
</think>
<|im_end|>

lm_prompt_negative

<|im_start|>system
# Instruction
Generate audio semantic tokens based on the given conditions:

<|im_end|>
<|im_start|>user
# Caption
Blah blah user caption
# Lyric
[Instrumental]
<|im_end|>
<|im_start|>assistant
<think>
</think>
<|im_end|>

lyrics

# Languages
zh

# Lyric
[Instrumental]<|endoftext|><|endoftext|>

qwen3_06b

# Instruction
Generate audio semantic tokens based on the given conditions:

# Caption
Blah blah user caption
# Metas
- bpm: 160
- duration: 175 seconds
- keyscale: D major
- timesignature: 3
<|endoftext|>
<|endoftext|>

No newline between the # Caption and # Metas section looks a bit weird, however the official template would have the same result if there wasn't a trailing newline: https://github.com/ace-step/ACE-Step-1.5/blob/eafcc2098696c60fb9e35d91813d84282d78959a/acestep/constants.py#L101


Potential issues this pull doesn't address:

  1. I am suspicious of the the double <|endoftext|> tokens at the ends of some of the prompts. Also one has a newline in between the tokens, one doesn't.
  2. Duration possibly should have a unit seconds in the places it appears. I only fixed the one place I was pretty sure about.
  3. Maybe ensuring a double space between the # Caption and # Metas sections in the SFT prompt would be a good idea.
  4. I am pretty sure the number of LM codes can be mismatched with the latent size since the audio codes output isn't going to be exact (as far as I know) but something roughly at the specified duration. You're likely to get very suboptimal results with LM codes for a, let's say, 2 minute song when you only provide a 1:45 minute latent. It would probably make sense to have a node to make an empty latent based on the conditioning after it was generated to ensure the sizes are copacetic.
  5. Forcing a specific time signature, key, language, etc always, no matter what is not how the official model works and severely limits creativity/diversity in results. There is also no way to change stuff like the LM negative prompt, CFG scale, sampling parameters, etc.
  6. The model also doesn't even require the LM codes part, but the existing implementation (not talking about TE here) doesn't even have a code path that will handle the LM codes not being present.
  7. The current implementation is very, very basic (aside from actual issues like malformed prompt encoding) and is going to produce much worse results than the official implementation even for the features it does support. Hopefully this will be improved, but since ACE 1.5 never even supported languages other than English and Japanese so being optimistic may not be warranted.

Unfortunately, I don't really have time at the moment to do more than whine about those other issues.

@zwukong
Copy link

zwukong commented Feb 5, 2026

The main problem that i discovered now is that there are too many missing words. Sometimes, a single paragraph is gone. some tags have no effects, such as male voice. Another issue is that it's rather slow, and the time cannot be set to auto yet. The sound quality is fine

@comfyanonymous comfyanonymous merged commit a246cc0 into Comfy-Org:master Feb 5, 2026
12 checks passed
@zwukong
Copy link

zwukong commented Feb 5, 2026

Issues still the same , in my tests,nothing changed. A whole paragraph's gone still can happen. A lot of missing words. [Male Vocal] still has no effect.
Env: qwen 4B, turbo, 312 2.9cu130

@gibru
Copy link

gibru commented Feb 10, 2026

In my tests, prompt adherence has mixed results between v0.12.2 and v0.12.3. Still playing with both versions to figure out the nuances, especially because the text encoder for Ace has some gained some additional settings (previously not exposed?).

One example: In v0.12.2 I could start a song with a tag like [Saxophone Intro] and it would work. Using identical settings (incl. kSampler), v0.12.3 ignores it completely. At the same time I get the feeling that in other areas v0.12.3 has some improvements in prompt adherence. However, when it comes to taste, for my kind of prompting v0.12.2 produces the better results overall. Sound quality seems to have improved at least in v0.12.3+ For reference: ace-step-v1.5-sft + dual clip with 0.6b and 1.7b.

@zwukong
Copy link

zwukong commented Feb 10, 2026

very intertesting , without llm is better, better than 1.5 , even 4B,at least male voice can show 😄

luna-niemitalo pushed a commit to luna-niemitalo/ComfyUI that referenced this pull request Feb 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants