Skip to content

Conversation

@rattus128
Copy link
Contributor

@rattus128 rattus128 commented Nov 8, 2025

This will fix a particular VRAM oom for a range of workflows but in particular flows re-using a model for upscale.

See below for root cause and fix.

Example test case:
upscale_oom.json
WAN 128x128x181f > x8 upscale (1024x1024x181f) > Same WAN model
RTX5090

To see the GUI go to: http://0.0.0.0:8188
To see the GUI go to: http://[::]:8188
got prompt
Using scaled fp8: fp8 matrix mult: True, scale input: True
model weight dtype torch.float16, manual cast: None
model_type FLOW
Using scaled fp8: fp8 matrix mult: False, scale input: False
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load WanTEModel
loaded completely; 30235.05 MB usable, 6419.48 MB loaded, full load: True
Requested to load WAN21
loaded completely; 23645.45 MB usable, 13629.08 MB loaded, full load: True
100%|██████████| 1/1 [00:00<00:00,  9.40it/s]
0 models unloaded.
  0%|          | 0/1 [00:00<?, ?it/s]
!!! Exception during processing !!! Allocation on device 
Traceback (most recent call last):
  File "/home/rattus/ComfyUI/execution.py", line 510, in execute
    output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/execution.py", line 324, in get_output_data
    return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
                    ...
  File "/home/rattus/ComfyUI/comfy/ldm/wan/model.py", line 78, in forward
    q = qkv_fn_q(x)
        ^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/ldm/wan/model.py", line 69, in qkv_fn_q
    return apply_rope1(q, freqs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/rattus/ComfyUI/comfy/ldm/flux/math.py", line 33, in apply_rope1
    x_out = freqs_cis[..., 0] * x_[..., 0]
            ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~
torch.OutOfMemoryError: Allocation on device 

Got an OOM, unloading all loaded models.
Prompt executed in 10.47 seconds

With this fix:

To see the GUI go to: http://0.0.0.0:8188
To see the GUI go to: http://[::]:8188
[DEPRECATION WARNING] Detected import of deprecated legacy API: /scripts/ui.js. This is likely caused by a custom node extension using outdated APIs. Please update your extensions or contact the extension author for an updated version.
[DEPRECATION WARNING] Detected import of deprecated legacy API: /extensions/core/groupNode.js. This is likely caused by a custom node extension using outdated APIs. Please update your extensions or contact the extension author for an updated version.
[DEPRECATION WARNING] Detected import of deprecated legacy API: /scripts/ui/components/button.js. This is likely caused by a custom node extension using outdated APIs. Please update your extensions or contact the extension author for an updated version.
[DEPRECATION WARNING] Detected import of deprecated legacy API: /scripts/ui/components/buttonGroup.js. This is likely caused by a custom node extension using outdated APIs. Please update your extensions or contact the extension author for an updated version.
got prompt
Using scaled fp8: fp8 matrix mult: True, scale input: True
model weight dtype torch.float16, manual cast: None
model_type FLOW
Using scaled fp8: fp8 matrix mult: False, scale input: False
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load WanTEModel
loaded completely; 30109.99 MB usable, 6419.48 MB loaded, full load: True
Requested to load WAN21
loaded completely; 23520.39 MB usable, 13629.08 MB loaded, full load: True
100%|██████████| 1/1 [00:00<00:00,  1.56it/s]
Unloading WanTEModel
1 idle models unloaded.
Unloading WAN21
1 active models unloaded for increased offloading.
loaded partially; 128.00 MB usable, 124.27 MB loaded, 13504.81 MB offloaded, lowvram patches: 0
100%|██████████| 1/1 [05:00<00:00, 300.66s/it]
Prompt executed in 314.41 seconds


git commit message

In some workflows, its possible for a model to be used twice but with different requirements for the inference VRAM.

Currently, once a model is loaded at a certain level of offload, it will be preserved at that level of offload if it is used again. This will OOM if there is a major change in the size of the inference VRAM. This happens in your classic latent upscaling workflow where the same model is used twice to generate and upscale.

This is very noticable for WAN in particlar.

Fix by two-passing the model VRAM unload process, firstly trying with the existing list on idle models and then try again adding the actual models that are about to be loaded. This will implement the partial offload you need of your hot-in-VRAM model to make space for the bigger inference.

Improve info messages regarding any unloads done.

@rattus128 rattus128 marked this pull request as draft November 8, 2025 08:25
In some workflows, its possible for a model to be used twice but with
different requirements for the inference VRAM.

Currently, once a model is loaded at a certain level of offload, it will
be preserved at that level of offload if it is used again. This will OOM
if there is a major change in the size of the inference VRAM. This happens
in your classic latent upscaling workflow where the same model is used twice
to generate and upscale.

This is very noticable for WAN in particlar.

Fix by two-passing the model VRAM unload process, firstly trying with
the existing list on idle models and then try again adding the actual
models that are about to be loaded. This will implement the partial
offload you need of your hot-in-VRAM model to make space for the bigger
inference.

Improve info messages regarding any unloads done.
@rattus128 rattus128 force-pushed the prs/model-reuse-oom branch from 6468c4c to ca73329 Compare November 8, 2025 08:30
@rattus128 rattus128 marked this pull request as ready for review November 8, 2025 08:42
@rattus128
Copy link
Contributor Author

This also reproduced on a flow I was sent here:

city96/ComfyUI-GGUF#357 (comment)

This was a case of model reuse interposing a Lora.

@comfyanonymous
Copy link
Owner

This is slightly incorrect behavior. If you run a workflow where a text encoder that does not fit completely in memory I see it unload it completely between the positive prompt and the negative prompt. What should happen is a small partial unload instead.

@rattus128
Copy link
Contributor Author

This is slightly incorrect behavior. If you run a workflow where a text encoder that does not fit completely in memory I see it unload it completely between the positive prompt and the negative prompt. What should happen is a small partial unload instead.

I'll take look at this case. Thanks.

@comfyanonymous
Copy link
Owner

I didn't test it that much but this might be a better way: #10690

@rattus128
Copy link
Contributor Author

I didn't test it that much but this might be a better way: #10690

I tested this and looks good so far. Closing this one.

@rattus128 rattus128 closed this Nov 9, 2025
@rattus128
Copy link
Contributor Author

^^

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants