Massive hangups at each step of the workflow for the first run after autotune was enabled by default with --fast flag

### Custom Node Testing

- [x] I have tried disabling custom nodes and the issue persists (see [how to disable custom nodes](https://docs.comfy.org/troubleshooting/custom-node-issues#step-1%3A-test-with-all-custom-nodes-disabled) if you need help)

### Expected Behavior

~15 seconds overhead for the first run

### Actual Behavior

123 seconds overhead on simple xl workflow. First run execution time 135.39 seconds, second only 12.50 seconds. Besides core functions, the more things you add later, the more overhead it will have, especially bad with detailer custom nodes, first time execution rises to 404.44 seconds with just one facedetailer.

### Steps to Reproduce

[2-pass workflow with upscale](https://files.catbox.moe/63f9yq.png) it was tested on and ended up with 123s overhead. Only explicitly disabling it via `--fast fp16_accumulation fp8_matrix_mult cublas_ops` startup flag after relaunch helps. 

### Debug Logs

```powershell
Total VRAM 48519 MB, total RAM 160650 MB
pytorch version: 2.8.0+cu128
xformers version: 0.0.32.post2
Enabled fp16 accumulation.
Set vram state to: HIGH_VRAM
Device: cuda:0 NVIDIA GeForce RTX 4090 D : cudaMallocAsync
Using xformers attention
Python version: 3.12.3 (main, Aug 14 2025, 17:47:21) [GCC 13.3.0]
ComfyUI version: 0.3.57
ComfyUI frontend version: 1.25.11
Skipping loading of custom nodes
Context impl SQLiteImpl.
Will assume non-transactional DDL.
No target revision found.
Starting server

To see the GUI go to: http://0.0.0.0:8188
To see the GUI go to: http://[::]:8188
got prompt
model weight dtype torch.float16, manual cast: None
model_type V_PREDICTION
Using xformers attention in VAE
Using xformers attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
loaded diffusion model directly to GPU
Requested to load SDXL
loaded completely 9.5367431640625e+25 4897.0483474731445 True
Requested to load SDXLClipModel
loaded completely 9.5367431640625e+25 1560.802734375 True
100%|█████████████████████████████████████████████████████████| 28/28 [00:29<00:00,  1.06s/it]
Requested to load AutoencoderKL
loaded completely 9.5367431640625e+25 159.55708122253418 True
 61%|██████████████████████████████████▊                      | 11/18 [00:26<00:03,  2.00it/s]
100%|█████████████████████████████████████████████████████████| 18/18 [00:28<00:00,  1.57s/it]
Prompt executed in 135.39 seconds
got prompt
100%|█████████████████████████████████████████████████████████| 28/28 [00:03<00:00,  8.54it/s]
100%|█████████████████████████████████████████████████████████| 18/18 [00:05<00:00,  3.39it/s]
Prompt executed in 12.50 seconds
```

### Other

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Massive hangups at each step of the workflow for the first run after autotune was enabled by default with --fast flag #9779

Custom Node Testing

Expected Behavior

Actual Behavior

Steps to Reproduce

Debug Logs

Other

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Massive hangups at each step of the workflow for the first run after autotune was enabled by default with --fast flag #9779

Description

Custom Node Testing

Expected Behavior

Actual Behavior

Steps to Reproduce

Debug Logs

Other

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions