Skip to content

Initial ACE-Step model implementation.#7972

Merged
comfyanonymous merged 5 commits intomasterfrom
temp_pr
May 7, 2025
Merged

Initial ACE-Step model implementation.#7972
comfyanonymous merged 5 commits intomasterfrom
temp_pr

Conversation

@comfyanonymous
Copy link
Member

@comfyanonymous comfyanonymous commented May 7, 2025

Put in ComfyUI/models/checkpoints: https://huggingface.co/Comfy-Org/ACE-Step_ComfyUI_repackaged/tree/main/all_in_one

Copy paste to ComfyUI for workflow:

{
  "id": "88ac5dad-efd7-40bb-84fe-fbaefdee1fa9",
  "revision": 0,
  "last_node_id": 45,
  "last_link_id": 112,
  "nodes": [
    {
      "id": 44,
      "type": "ConditioningZeroOut",
      "pos": [
        785,
        459
      ],
      "size": [
        197.712890625,
        26
      ],
      "flags": {},
      "order": 4,
      "mode": 0,
      "inputs": [
        {
          "name": "conditioning",
          "type": "CONDITIONING",
          "link": 108
        }
      ],
      "outputs": [
        {
          "name": "CONDITIONING",
          "type": "CONDITIONING",
          "links": [
            109
          ]
        }
      ],
      "properties": {
        "Node name for S&R": "ConditioningZeroOut"
      },
      "widgets_values": []
    },
    {
      "id": 40,
      "type": "CheckpointLoaderSimple",
      "pos": [
        179.5068359375,
        87.76739501953125
      ],
      "size": [
        375,
        98
      ],
      "flags": {},
      "order": 0,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "MODEL",
          "type": "MODEL",
          "links": [
            111
          ]
        },
        {
          "name": "CLIP",
          "type": "CLIP",
          "links": [
            80
          ]
        },
        {
          "name": "VAE",
          "type": "VAE",
          "links": [
            83
          ]
        }
      ],
      "properties": {
        "Node name for S&R": "CheckpointLoaderSimple"
      },
      "widgets_values": [
        "ace_step_v1_3.5b.safetensors"
      ]
    },
    {
      "id": 18,
      "type": "VAEDecodeAudio",
      "pos": [
        1370,
        100
      ],
      "size": [
        150.93612670898438,
        46
      ],
      "flags": {},
      "order": 6,
      "mode": 0,
      "inputs": [
        {
          "name": "samples",
          "type": "LATENT",
          "link": 101
        },
        {
          "name": "vae",
          "type": "VAE",
          "link": 83
        }
      ],
      "outputs": [
        {
          "name": "AUDIO",
          "type": "AUDIO",
          "links": [
            26
          ]
        }
      ],
      "properties": {
        "Node name for S&R": "VAEDecodeAudio"
      },
      "widgets_values": []
    },
    {
      "id": 17,
      "type": "EmptyAceStepLatentAudio",
      "pos": [
        710,
        540
      ],
      "size": [
        270,
        82
      ],
      "flags": {},
      "order": 1,
      "mode": 0,
      "inputs": [],
      "outputs": [
        {
          "name": "LATENT",
          "type": "LATENT",
          "links": [
            23
          ]
        }
      ],
      "properties": {
        "Node name for S&R": "EmptyAceStepLatentAudio"
      },
      "widgets_values": [
        120,
        1
      ]
    },
    {
      "id": 19,
      "type": "SaveAudio",
      "pos": [
        1539,
        100
      ],
      "size": [
        295.0655212402344,
        112
      ],
      "flags": {},
      "order": 7,
      "mode": 0,
      "inputs": [
        {
          "name": "audio",
          "type": "AUDIO",
          "link": 26
        }
      ],
      "outputs": [],
      "properties": {},
      "widgets_values": [
        "audio/ComfyUI"
      ]
    },
    {
      "id": 3,
      "type": "KSampler",
      "pos": [
        1040,
        90
      ],
      "size": [
        315,
        262
      ],
      "flags": {},
      "order": 5,
      "mode": 0,
      "inputs": [
        {
          "name": "model",
          "type": "MODEL",
          "link": 112
        },
        {
          "name": "positive",
          "type": "CONDITIONING",
          "link": 110
        },
        {
          "name": "negative",
          "type": "CONDITIONING",
          "link": 109
        },
        {
          "name": "latent_image",
          "type": "LATENT",
          "link": 23
        }
      ],
      "outputs": [
        {
          "name": "LATENT",
          "type": "LATENT",
          "slot_index": 0,
          "links": [
            101
          ]
        }
      ],
      "properties": {
        "Node name for S&R": "KSampler"
      },
      "widgets_values": [
        315277030015967,
        "randomize",
        50,
        4,
        "res_multistep",
        "simple",
        1
      ]
    },
    {
      "id": 45,
      "type": "ModelSamplingSD3",
      "pos": [
        716.4029541015625,
        -30.665313720703125
      ],
      "size": [
        270,
        58
      ],
      "flags": {},
      "order": 2,
      "mode": 0,
      "inputs": [
        {
          "name": "model",
          "type": "MODEL",
          "link": 111
        }
      ],
      "outputs": [
        {
          "name": "MODEL",
          "type": "MODEL",
          "links": [
            112
          ]
        }
      ],
      "properties": {
        "Node name for S&R": "ModelSamplingSD3"
      },
      "widgets_values": [
        4.000000000000001
      ]
    },
    {
      "id": 14,
      "type": "TextEncodeAceStepAudio",
      "pos": [
        580,
        83
      ],
      "size": [
        410.834716796875,
        305.39215087890625
      ],
      "flags": {},
      "order": 3,
      "mode": 0,
      "inputs": [
        {
          "name": "clip",
          "type": "CLIP",
          "link": 80
        }
      ],
      "outputs": [
        {
          "name": "CONDITIONING",
          "type": "CONDITIONING",
          "links": [
            108,
            110
          ]
        }
      ],
      "properties": {
        "Node name for S&R": "TextEncodeAceStepAudio"
      },
      "widgets_values": [
        "female, electronic, vocals, singing, upbeat, fast, fennec core ",
        "[verse]\ncute fennec girl\nmassive fennec ears\nbig fluffy tail\nlong blonde wavy hair\nlarge blue eyes\nI love fennec girl\n"
      ]
    }
  ],
  "links": [
    [
      23,
      17,
      0,
      3,
      3,
      "LATENT"
    ],
    [
      26,
      18,
      0,
      19,
      0,
      "AUDIO"
    ],
    [
      80,
      40,
      1,
      14,
      0,
      "CLIP"
    ],
    [
      83,
      40,
      2,
      18,
      1,
      "VAE"
    ],
    [
      101,
      3,
      0,
      18,
      0,
      "LATENT"
    ],
    [
      108,
      14,
      0,
      44,
      0,
      "CONDITIONING"
    ],
    [
      109,
      44,
      0,
      3,
      2,
      "CONDITIONING"
    ],
    [
      110,
      14,
      0,
      3,
      1,
      "CONDITIONING"
    ],
    [
      111,
      40,
      0,
      45,
      0,
      "MODEL"
    ],
    [
      112,
      45,
      0,
      3,
      0,
      "MODEL"
    ]
  ],
  "groups": [],
  "config": {},
  "extra": {
    "frontendVersion": "1.18.9"
  },
  "version": 0.4
}

@comfyanonymous comfyanonymous added the Run-CI-Test This is an administrative label to tell the CI to run full automatic testing on this PR now. label May 7, 2025
@github-actions
Copy link
Contributor

github-actions bot commented May 7, 2025

(Automated Bot Message) CI Tests are running, you can view the results at https://ci.comfy.org/?branch=7972%2Fmerge

@comfyanonymous comfyanonymous merged commit 16417b4 into master May 7, 2025
19 checks passed
@comfyanonymous comfyanonymous deleted the temp_pr branch May 7, 2025 12:34
@mcmonkey4eva
Copy link
Contributor

This uses torchaudio in an unchecked load - ie in an environment where torchaudio isn't available, comfy fails to boot

@comfyanonymous
Copy link
Member Author

torchaudio has been in the requirements.txt for 11 months now, which environment doesn't have torchaudio?

@mcmonkey4eva
Copy link
Contributor

DirectML mainly (I know, that's the worst way to run anything, but people do it sometimes). To my understanding it's basically any "modified torch version" is missing torchaudio usually, ie other non-nvidia GPU setups tend to be missing it too. I think even the early blackwell torch had audio wonked? Not sure on that bit, secondhand memory.

@comfyanonymous
Copy link
Member Author

Should be fixed now.

@ChuxiJ
Copy link

ChuxiJ commented May 8, 2025

I found the issue: ace-step/ACE-Step#54
I am figuring out what the difference is.

@planb788
Copy link

planb788 commented May 8, 2025

With the default workflow settings, I can't sing Chinese songs, but I can sing them on the Gradio interface of ACE Step.

@allo-
Copy link

allo- commented May 8, 2025

I tried the workflow and it works fine, but I have the following problems. It seems to use more RAM in the VAE (or not release caches before?) than the official implementation and then falls back to Tiled VAE for generations that the Gradio UI can do without tiled VAE. Second the quality of longer songs is worse than the official one, can this be an effect of using the tiled VAE for longer songs, or should it have the same quality?

@agustincaniglia
Copy link

my nodes aren't loading...

@mcmonkey4eva
Copy link
Contributor

mcmonkey4eva commented May 9, 2025

@planb788 I had the same issue earlier, C/J/K characters don't seem to pass through right -- EDIT 5d3cc85 looks like specific custom hacks are needed? This commit added Japanese in particular by just converting it on the fly to latin characters.
@allo- see the above link ace-step/ACE-Step#54 it has some discussion about the differences in parameters between comfy and the gradio, comfy's workflow went for different defaults than the gradio uses.
@agustincaniglia sounds like a support issue, not specific to this PR - open an issue on the issues tab or join the https://discord.gg/comfyorg and ask for help there

@allo-
Copy link

allo- commented May 9, 2025

@mcmonkey4eva I'm following both discussions and would like to combine the best of both approaches to get longer generations with the best quality. At the moment I'm probably waiting for the gradio app to get the multires scheduler as it exposes more control, but we'll probably see workflows soon that also control the parameters exposed in the official UI.

My main question at the moment is whether the tiled VAE affects the quality. It's hard to tell which artifacts come from the model and which may come from such workarounds for low VRAM.

On my system, the gradio apps work with 16 GB VRAM (almost full when VAE is loaded) and Comfy needs to tile the VAE and also seems to need more VRAM for longer generation, while the VRAM requierement during generation seems to be almost the same as for shorter ones in the gradio app.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Run-CI-Test This is an administrative label to tell the CI to run full automatic testing on this PR now.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants