dynamic_vram: Fix windows Aimdo crash + Fix LLM performance by rattus128 · Pull Request #12408 · Comfy-Org/ComfyUI

rattus128 · 2026-02-11T13:35:58Z

On windows, torch can assert against Tensor construction if the VRAM for a tensor is absent physical backing. This means we cannot pre-create aimdo tensors at load time (see revert).

To get the CPU perf closure previously attempted, instead create the tensor on non-resident cache hit (usually the first step) so its validity just runs parallel to signature.

Following that, do a perf push to fully minimize the fast path through the comfy caster to speed up CPU bound LLM inference on high-end GPUs. Primary commit message:

dynamic_vram: Minimize fast path CPU work

Move as much as possible inside the not resident if block and cache
the formed weight and bias rather than the flat intermediates. In
extreme layer weight rates this adds up.

Example test conditions:

Ace step 1.5 Template workflow, 195s
Linux, RTX6000 Blackwell Pro, AMD Ryzen 5 9600X
--fast dynamic_vram

Before (15.5s):

got prompt
VAE load device: cuda:0, offload device: cpu, dtype: torch.float16
Requested to load ACE15TEModel_
Model ACE15TEModel_ prepared for dynamic VRAM loading. 4673MB Staged. 0 patches attached.
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cuda:0, dtype: torch.float16
Model ACE15TEModel_ prepared for dynamic VRAM loading. 4673MB Staged. 0 patches attached.
LM sampling: 100%|████████████████████████████████████████████████████████████████████| 975/975 [00:09<00:00, 107.60it/s]
model weight dtype torch.bfloat16, manual cast: None
model_type FLOW
Requested to load ACEStep15
Model ACEStep15 prepared for dynamic VRAM loading. 4565MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 14.75it/s]
Requested to load AudioOobleckVAE
loaded completely;  321.70 MB loaded, full load: True
Prompt executed in 15.55 seconds

After (14.0s):

got prompt
VAE load device: cuda:0, offload device: cpu, dtype: torch.float16
Requested to load ACE15TEModel_
Model ACE15TEModel_ prepared for dynamic VRAM loading. 4673MB Staged. 0 patches attached.
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cuda:0, dtype: torch.float16
Model ACE15TEModel_ prepared for dynamic VRAM loading. 4673MB Staged. 0 patches attached.
LM sampling: 100%|████████████████████████████████████████████████████████████████████| 975/975 [00:07<00:00, 125.12it/s]
model weight dtype torch.bfloat16, manual cast: None
model_type FLOW
Requested to load ACEStep15
Model ACEStep15 prepared for dynamic VRAM loading. 4565MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 14.96it/s]
Requested to load AudioOobleckVAE
loaded completely;  321.70 MB loaded, full load: True
Prompt executed in 14.05 seconds

Without dynamic_vram (14.6s):

got prompt
VAE load device: cuda:0, offload device: cpu, dtype: torch.float16
Requested to load ACE15TEModel_
loaded completely;  4673.04 MB loaded, full load: True
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cuda:0, dtype: torch.float16
LM sampling: 100%|████████████████████████████████████████████████████████████████████| 975/975 [00:07<00:00, 124.41it/s]
model weight dtype torch.bfloat16, manual cast: None
model_type FLOW
Requested to load ACEStep15
loaded completely; 90586.76 MB usable, 4565.35 MB loaded, full load: True
100%|██████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 17.32it/s]
Requested to load AudioOobleckVAE
loaded completely;  321.70 MB loaded, full load: True
Prompt executed in 14.58 seconds

Example test conditions (crash-fix):

Windows, RTX3060, LTX2 I2V.

Before:

  File "C:\Users\rattus\Comfyui\comfy_api\latest\_io.py", line 1710, in EXECUTE_NORMALIZED
    to_return = cls.execute(*args, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\rattus\ComfyUI\comfy_extras\nodes_lt.py", line 121, in execute
    t = vae.encode(encode_pixels)
        ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\rattus\Comfyui\comfy\sd.py", line 1006, in encode
    model_management.load_models_gpu([self.patcher], memory_required=memory_used, force_full_load=self.disable_offload)
  File "C:\Users\rattus\Comfyui\comfy\model_management.py", line 751, in load_models_gpu
    loaded_model.model_load(lowvram_model_memory, force_patch_weights=force_patch_weights)
  File "C:\Users\rattus\Comfyui\comfy\model_management.py", line 533, in model_load
    self.model_use_more_vram(use_more_vram, force_patch_weights=force_patch_weights)
  File "C:\Users\rattus\Comfyui\comfy\model_management.py", line 563, in model_use_more_vram
    return self.model.partially_load(self.device, extra_memory, force_patch_weights=force_patch_weights)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\rattus\Comfyui\comfy\model_patcher.py", line 1618, in partially_load
    raise e
  File "C:\Users\rattus\Comfyui\comfy\model_patcher.py", line 1615, in partially_load
    self.load(device_to, dirty=dirty)
  File "C:\Users\rattus\Comfyui\comfy\model_patcher.py", line 1545, in load
    m._v_tensor = comfy_aimdo.torch.aimdo_to_tensor(m._v, device_to)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\rattus\venvw\Lib\site-packages\comfy_aimdo\torch.py", line 24, in aimdo_to_tensor
    return get_tensor_from_raw_ptr(ptr, size, device)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\rattus\venvw\Lib\site-packages\comfy_aimdo\torch.py", line 20, in get_tensor_from_raw_ptr
    return torch.as_tensor(holder, device=device)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The specified pointer resides on host memory and is not registered with any CUDA device.

Prompt executed in 1.95 seconds

After:

got prompt
Requested to load LTXAV
Model LTXAV prepared for dynamic VRAM loading. 20542MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [02:53<00:00,  8.66s/it]
Requested to load VideoVAE
Model VideoVAE prepared for dynamic VRAM loading. 4663MB Staged. 0 patches attached.
Requested to load LTXAV
0 models unloaded.
Model LTXAV prepared for dynamic VRAM loading. 20542MB Staged. 1370 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [01:03<00:00, 21.05s/it]
Requested to load AudioVAE
loaded completely; 678.57 MB usable, 415.20 MB loaded, full load: True
0 models unloaded.
Model VideoVAE prepared for dynamic VRAM loading. 4663MB Staged. 0 patches attached.
Prompt executed in 322.24 seconds

This reverts commit 12028af.

These tensors cosntructed from aimdo-allocations are CPU expensive to make on the pytorch side. Add a cache version that will be valid with signature match to fast path past whatever torch is doing.

Move as much as possible inside the not resident if block and cache the formed weight and bias rather than the flat intermediates. In extreme layer weight rates this adds up.

rattus128 added 2 commits February 11, 2026 23:26

Revert "MPDynamic: Pre-generate the tensors for vbars"

89dc4a8

This reverts commit 12028af.

model_management: lazy-cache aimdo_tensor

f7aebdd

These tensors cosntructed from aimdo-allocations are CPU expensive to make on the pytorch side. Add a cache version that will be valid with signature match to fast path past whatever torch is doing.

rattus128 requested review from Kosinkadink, comfyanonymous and guill as code owners February 11, 2026 13:35

dynamic_vram: Minimize fast path CPU work

8423394

Move as much as possible inside the not resident if block and cache the formed weight and bias rather than the flat intermediates. In extreme layer weight rates this adds up.

rattus128 marked this pull request as draft February 11, 2026 14:44

rattus128 changed the title ~~dynamic_vram: Fix windows Aimdo crash~~ dynamic_vram: Fix windows Aimdo crash + Fix LLM performance Feb 11, 2026

rattus128 marked this pull request as ready for review February 11, 2026 16:12

comfyanonymous merged commit d297a74 into master Feb 11, 2026
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dynamic_vram: Fix windows Aimdo crash + Fix LLM performance#12408

dynamic_vram: Fix windows Aimdo crash + Fix LLM performance#12408
comfyanonymous merged 3 commits intomasterfrom
prs/dynamic-vram-fixes/windows-unbacked-virt-bug

rattus128 commented Feb 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rattus128 commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rattus128 commented Feb 11, 2026 •

edited

Loading