Skip to content

Dynamic VRAM fixes - Ace 1.5 performance + a VRAM leak#12368

Merged
comfyanonymous merged 4 commits intoComfy-Org:masterfrom
rattus128:prs/dynamic-vram-fixes/ace-llm-perf
Feb 9, 2026
Merged

Dynamic VRAM fixes - Ace 1.5 performance + a VRAM leak#12368
comfyanonymous merged 4 commits intoComfy-Org:masterfrom
rattus128:prs/dynamic-vram-fixes/ace-llm-perf

Conversation

@rattus128
Copy link
Contributor

@rattus128 rattus128 commented Feb 9, 2026

Effectively fully load non-comfy weights by using a new Aimdo lower-watermark feature which allows the non-comfy caster to skip the deep copy and unpin extra steps. On top of that, there is a significant speedup just not doing aimdo_to_tensor() in the critical path, as this only needs to be done once.

My performance gains are only very moderate on my setup however it does close the gap to non dynamic_vram for my hardware.

Bump to the new aimdo version to pick up the feature.

Fix a VRAM leak from community testing this morning. Thanks to TK3R from Discord for the joint debug session.

Example test conditions - Ace step:

Ace Step 1.5 AIO. RTX 5090 Linux --fast dynamic_vram

image

Before:

got prompt
model weight dtype torch.bfloat16, manual cast: None
model_type FLOW
VAE load device: cuda:0, offload device: cpu, dtype: torch.float16
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load ACE15TEModel_
Model ACE15TEModel_ prepared for dynamic VRAM loading. 4673MB Staged. 0 patches attached.
Requested to load ACEStep15
Model ACEStep15 prepared for dynamic VRAM loading. 4565MB Staged. 0 patches attached.
100%|██████████| 8/8 [00:00<00:00, 12.44it/s]                                   
Requested to load AudioOobleckVAE
loaded completely;  321.70 MB loaded, full load: True
Prompt executed in 10.60 seconds
got prompt
Model ACE15TEModel_ prepared for dynamic VRAM loading. 4673MB Staged. 0 patches attached.
Model ACEStep15 prepared for dynamic VRAM loading. 4565MB Staged. 0 patches attached.
100%|██████████| 8/8 [00:00<00:00, 14.05it/s]                                   
Prompt executed in 9.29 seconds
got prompt
Model ACE15TEModel_ prepared for dynamic VRAM loading. 4673MB Staged. 0 patches attached.
Model ACEStep15 prepared for dynamic VRAM loading. 4565MB Staged. 0 patches attached.
100%|██████████| 8/8 [00:00<00:00, 14.04it/s]                                   
Prompt executed in 9.35 seconds

After

To see the GUI go to: http://0.0.0.0:8188
got prompt
model weight dtype torch.bfloat16, manual cast: None
model_type FLOW
VAE load device: cuda:0, offload device: cpu, dtype: torch.float16
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load ACE15TEModel_
Model ACE15TEModel_ prepared for dynamic VRAM loading. 4673MB Staged. 0 patches attached.
Requested to load ACEStep15
Model ACEStep15 prepared for dynamic VRAM loading. 4565MB Staged. 0 patches attached.
100%|██████████| 8/8 [00:00<00:00, 12.47it/s]                                   
Requested to load AudioOobleckVAE
loaded completely;  321.70 MB loaded, full load: True
Prompt executed in 10.39 seconds
got prompt
Model ACE15TEModel_ prepared for dynamic VRAM loading. 4673MB Staged. 0 patches attached.
Model ACEStep15 prepared for dynamic VRAM loading. 4565MB Staged. 0 patches attached.
100%|██████████| 8/8 [00:00<00:00, 14.07it/s]                                   
Prompt executed in 8.95 seconds
got prompt
Model ACE15TEModel_ prepared for dynamic VRAM loading. 4673MB Staged. 0 patches attached.
Model ACEStep15 prepared for dynamic VRAM loading. 4565MB Staged. 0 patches attached.
100%|██████████| 8/8 [00:00<00:00, 14.05it/s]                                   
Prompt executed in 8.89 seconds

No dynamic VRAM:

got prompt
model weight dtype torch.bfloat16, manual cast: None
model_type FLOW
VAE load device: cuda:0, offload device: cpu, dtype: torch.float16
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load ACE15TEModel_
loaded completely; 30235.42 MB usable, 4673.04 MB loaded, full load: True
Requested to load ACEStep15
loaded completely; 25390.26 MB usable, 4565.35 MB loaded, full load: True

[rgthree-comfy] Loaded 48 magnificent nodes. 🎉

OpenCV not installed

Initializing ControlAltAI Nodes
100%|██████████| 8/8 [00:00<00:00, 14.27it/s]
Requested to load AudioOobleckVAE
loaded completely;  321.70 MB loaded, full load: True
Prompt executed in 10.76 seconds
got prompt
100%|██████████| 8/8 [00:00<00:00, 15.41it/s]
Prompt executed in 8.93 seconds
got prompt
100%|██████████| 8/8 [00:00<00:00, 15.39it/s]
Prompt executed in 8.98 seconds
got prompt
100%|██████████| 8/8 [00:00<00:00, 15.38it/s]
Prompt executed in 8.94 seconds

VRAM Leak Example test conditions:

WAN 2.2 14B GGUF Q8 low noise into FP16 high noise. RTX 5090 Linux --fast dynamic_vram

Before:

image

^^ That dip is inference VRAM release from GGUF low noise then it doesn't unload the model. High noise then dynamically offloads (the pins are the climbing RAM usage).

Requested to load WAN21
loaded completely; 22504.73 MB usable, 14825.46 MB loaded, full load: True
100%|██████████| 2/2 [00:19<00:00,  9.84s/it]                                   
model weight dtype torch.float16, manual cast: None
model_type FLOW
Requested to load WAN21
Model WAN21 prepared for dynamic VRAM loading. 27252MB Staged. 400 patches attached.
100%|██████████| 2/2 [00:21<00:00, 10.59s/it]                                   
Model WanVAE prepared for dynamic VRAM loading. 484MB Staged. 0 patches attached.
Prompt executed in 85.25 seconds

After:

image

^^ The dip is the GGUF model getting properly released from VRAM. Then Fully loaded high noise on the 5090 as expected.

Requested to load WAN21
100%|██████████| 2/2 [00:19<00:00,  9.78s/it]                                   
model weight dtype torch.float16, manual cast: None
model_type FLOW
Requested to load WAN21
Model WAN21 prepared for dynamic VRAM loading. 27252MB Staged. 400 patches attached.
100%|██████████| 2/2 [00:19<00:00,  9.89s/it]                                   
Model WanVAE prepared for dynamic VRAM loading. 484MB Staged. 0 patches attached.
Prompt executed in 75.30 seconds

This change was only needed to get around the pytorch 2.7 mempool bugs,
and should have been reverted along with Comfy-Org#12260. This fixes a different
memory leak where pytorch gets confused about cache emptying.
Apparently this is an expensive operation that slows down things.
New features:
watermark limit feature
logging enhancements
-O2 build on linux
@rattus128 rattus128 force-pushed the prs/dynamic-vram-fixes/ace-llm-perf branch from 4ead8ae to 7346af0 Compare February 9, 2026 07:50
@comfyanonymous comfyanonymous merged commit 62315fb into Comfy-Org:master Feb 9, 2026
12 checks passed
luna-niemitalo pushed a commit to luna-niemitalo/ComfyUI that referenced this pull request Feb 11, 2026
* revert threaded model loader change

This change was only needed to get around the pytorch 2.7 mempool bugs,
and should have been reverted along with Comfy-Org#12260. This fixes a different
memory leak where pytorch gets confused about cache emptying.

* load non comfy weights

* MPDynamic: Pre-generate the tensors for vbars

Apparently this is an expensive operation that slows down things.

* bump to aimdo 1.8

New features:
watermark limit feature
logging enhancements
-O2 build on linux
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants