Skip to content

Massive hangups at each step of the workflow for the first run after autotune was enabled by default with --fast flag #9779

@whythisusername

Description

@whythisusername

Custom Node Testing

Expected Behavior

~15 seconds overhead for the first run

Actual Behavior

123 seconds overhead on simple xl workflow. First run execution time 135.39 seconds, second only 12.50 seconds. Besides core functions, the more things you add later, the more overhead it will have, especially bad with detailer custom nodes, first time execution rises to 404.44 seconds with just one facedetailer.

Steps to Reproduce

2-pass workflow with upscale it was tested on and ended up with 123s overhead. Only explicitly disabling it via --fast fp16_accumulation fp8_matrix_mult cublas_ops startup flag after relaunch helps.

Debug Logs

Total VRAM 48519 MB, total RAM 160650 MB
pytorch version: 2.8.0+cu128
xformers version: 0.0.32.post2
Enabled fp16 accumulation.
Set vram state to: HIGH_VRAM
Device: cuda:0 NVIDIA GeForce RTX 4090 D : cudaMallocAsync
Using xformers attention
Python version: 3.12.3 (main, Aug 14 2025, 17:47:21) [GCC 13.3.0]
ComfyUI version: 0.3.57
ComfyUI frontend version: 1.25.11
Skipping loading of custom nodes
Context impl SQLiteImpl.
Will assume non-transactional DDL.
No target revision found.
Starting server

To see the GUI go to: http://0.0.0.0:8188
To see the GUI go to: http://[::]:8188
got prompt
model weight dtype torch.float16, manual cast: None
model_type V_PREDICTION
Using xformers attention in VAE
Using xformers attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
loaded diffusion model directly to GPU
Requested to load SDXL
loaded completely 9.5367431640625e+25 4897.0483474731445 True
Requested to load SDXLClipModel
loaded completely 9.5367431640625e+25 1560.802734375 True
100%|█████████████████████████████████████████████████████████| 28/28 [00:29<00:00,  1.06s/it]
Requested to load AutoencoderKL
loaded completely 9.5367431640625e+25 159.55708122253418 True
 61%|██████████████████████████████████▊                      | 11/18 [00:26<00:03,  2.00it/s]
100%|█████████████████████████████████████████████████████████| 18/18 [00:28<00:00,  1.57s/it]
Prompt executed in 135.39 seconds
got prompt
100%|█████████████████████████████████████████████████████████| 28/28 [00:03<00:00,  8.54it/s]
100%|█████████████████████████████████████████████████████████| 18/18 [00:05<00:00,  3.39it/s]
Prompt executed in 12.50 seconds

Other

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Potential BugUser is reporting a bug. This should be tested.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions