Skip to content

dynamic_vram: Fix windows Aimdo crash + Fix LLM performance#12408

Merged
comfyanonymous merged 3 commits intomasterfrom
prs/dynamic-vram-fixes/windows-unbacked-virt-bug
Feb 11, 2026
Merged

dynamic_vram: Fix windows Aimdo crash + Fix LLM performance#12408
comfyanonymous merged 3 commits intomasterfrom
prs/dynamic-vram-fixes/windows-unbacked-virt-bug

Conversation

@rattus128
Copy link
Contributor

@rattus128 rattus128 commented Feb 11, 2026

#12401

On windows, torch can assert against Tensor construction if the VRAM for a tensor is absent physical backing. This means we cannot pre-create aimdo tensors at load time (see revert).

To get the CPU perf closure previously attempted, instead create the tensor on non-resident cache hit (usually the first step) so its validity just runs parallel to signature.

Following that, do a perf push to fully minimize the fast path through the comfy caster to speed up CPU bound LLM inference on high-end GPUs. Primary commit message:

dynamic_vram: Minimize fast path CPU work

Move as much as possible inside the not resident if block and cache
the formed weight and bias rather than the flat intermediates. In
extreme layer weight rates this adds up.

Example test conditions:

Ace step 1.5 Template workflow, 195s
Linux, RTX6000 Blackwell Pro, AMD Ryzen 5 9600X
--fast dynamic_vram

image

Before (15.5s):

got prompt
VAE load device: cuda:0, offload device: cpu, dtype: torch.float16
Requested to load ACE15TEModel_
Model ACE15TEModel_ prepared for dynamic VRAM loading. 4673MB Staged. 0 patches attached.
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cuda:0, dtype: torch.float16
Model ACE15TEModel_ prepared for dynamic VRAM loading. 4673MB Staged. 0 patches attached.
LM sampling: 100%|████████████████████████████████████████████████████████████████████| 975/975 [00:09<00:00, 107.60it/s]
model weight dtype torch.bfloat16, manual cast: None
model_type FLOW
Requested to load ACEStep15
Model ACEStep15 prepared for dynamic VRAM loading. 4565MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 14.75it/s]
Requested to load AudioOobleckVAE
loaded completely;  321.70 MB loaded, full load: True
Prompt executed in 15.55 seconds

After (14.0s):

got prompt
VAE load device: cuda:0, offload device: cpu, dtype: torch.float16
Requested to load ACE15TEModel_
Model ACE15TEModel_ prepared for dynamic VRAM loading. 4673MB Staged. 0 patches attached.
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cuda:0, dtype: torch.float16
Model ACE15TEModel_ prepared for dynamic VRAM loading. 4673MB Staged. 0 patches attached.
LM sampling: 100%|████████████████████████████████████████████████████████████████████| 975/975 [00:07<00:00, 125.12it/s]
model weight dtype torch.bfloat16, manual cast: None
model_type FLOW
Requested to load ACEStep15
Model ACEStep15 prepared for dynamic VRAM loading. 4565MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 14.96it/s]
Requested to load AudioOobleckVAE
loaded completely;  321.70 MB loaded, full load: True
Prompt executed in 14.05 seconds

Without dynamic_vram (14.6s):

got prompt
VAE load device: cuda:0, offload device: cpu, dtype: torch.float16
Requested to load ACE15TEModel_
loaded completely;  4673.04 MB loaded, full load: True
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cuda:0, dtype: torch.float16
LM sampling: 100%|████████████████████████████████████████████████████████████████████| 975/975 [00:07<00:00, 124.41it/s]
model weight dtype torch.bfloat16, manual cast: None
model_type FLOW
Requested to load ACEStep15
loaded completely; 90586.76 MB usable, 4565.35 MB loaded, full load: True
100%|██████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 17.32it/s]
Requested to load AudioOobleckVAE
loaded completely;  321.70 MB loaded, full load: True
Prompt executed in 14.58 seconds

Example test conditions (crash-fix):

Windows, RTX3060, LTX2 I2V.

Before:

  File "C:\Users\rattus\Comfyui\comfy_api\latest\_io.py", line 1710, in EXECUTE_NORMALIZED
    to_return = cls.execute(*args, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\rattus\ComfyUI\comfy_extras\nodes_lt.py", line 121, in execute
    t = vae.encode(encode_pixels)
        ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\rattus\Comfyui\comfy\sd.py", line 1006, in encode
    model_management.load_models_gpu([self.patcher], memory_required=memory_used, force_full_load=self.disable_offload)
  File "C:\Users\rattus\Comfyui\comfy\model_management.py", line 751, in load_models_gpu
    loaded_model.model_load(lowvram_model_memory, force_patch_weights=force_patch_weights)
  File "C:\Users\rattus\Comfyui\comfy\model_management.py", line 533, in model_load
    self.model_use_more_vram(use_more_vram, force_patch_weights=force_patch_weights)
  File "C:\Users\rattus\Comfyui\comfy\model_management.py", line 563, in model_use_more_vram
    return self.model.partially_load(self.device, extra_memory, force_patch_weights=force_patch_weights)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\rattus\Comfyui\comfy\model_patcher.py", line 1618, in partially_load
    raise e
  File "C:\Users\rattus\Comfyui\comfy\model_patcher.py", line 1615, in partially_load
    self.load(device_to, dirty=dirty)
  File "C:\Users\rattus\Comfyui\comfy\model_patcher.py", line 1545, in load
    m._v_tensor = comfy_aimdo.torch.aimdo_to_tensor(m._v, device_to)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\rattus\venvw\Lib\site-packages\comfy_aimdo\torch.py", line 24, in aimdo_to_tensor
    return get_tensor_from_raw_ptr(ptr, size, device)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\rattus\venvw\Lib\site-packages\comfy_aimdo\torch.py", line 20, in get_tensor_from_raw_ptr
    return torch.as_tensor(holder, device=device)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The specified pointer resides on host memory and is not registered with any CUDA device.

Prompt executed in 1.95 seconds

After:

got prompt
Requested to load LTXAV
Model LTXAV prepared for dynamic VRAM loading. 20542MB Staged. 0 patches attached.
100%|██████████████████████████████████████████████████████████████████████████████████| 20/20 [02:53<00:00,  8.66s/it]
Requested to load VideoVAE
Model VideoVAE prepared for dynamic VRAM loading. 4663MB Staged. 0 patches attached.
Requested to load LTXAV
0 models unloaded.
Model LTXAV prepared for dynamic VRAM loading. 20542MB Staged. 1370 patches attached.
100%|████████████████████████████████████████████████████████████████████████████████████| 3/3 [01:03<00:00, 21.05s/it]
Requested to load AudioVAE
loaded completely; 678.57 MB usable, 415.20 MB loaded, full load: True
0 models unloaded.
Model VideoVAE prepared for dynamic VRAM loading. 4663MB Staged. 0 patches attached.
Prompt executed in 322.24 seconds

These tensors cosntructed from aimdo-allocations are CPU expensive to
make on the pytorch side. Add a cache version that will be valid with
signature match to fast path past whatever torch is doing.
Move as much as possible inside the not resident if block and cache
the formed weight and bias rather than the flat intermediates. In
extreme layer weight rates this adds up.
@rattus128 rattus128 marked this pull request as draft February 11, 2026 14:44
@rattus128 rattus128 changed the title dynamic_vram: Fix windows Aimdo crash dynamic_vram: Fix windows Aimdo crash + Fix LLM performance Feb 11, 2026
@rattus128 rattus128 marked this pull request as ready for review February 11, 2026 16:12
@comfyanonymous comfyanonymous merged commit d297a74 into master Feb 11, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants