Fix SageAttention crash after PR #10276 fp8 weight scaling changes by djdarcy · Pull Request #10304 · Comfy-Org/ComfyUI

djdarcy · 2025-10-12T06:47:41Z

Problem:

After PR #10276 (commit 139addd) introduced convert_func/set_func for proper fp8 weight scaling during LoRA application, users with SageAttention enabled experience 100% reproducible crashes (Exception 0xC0000005 ACCESS_VIOLATION) during KSampler execution.

Root Cause:

PR #10276 added fp8 weight transformations (scale up -> apply LoRA -> scale down) to fix LoRA quality with Wan 2.1/2.2 14B fp8 models. These transformations:

Convert weights to float32 and create copies (new memory addresses)
Invalidate tensor metadata that SageAttention cached
Break SageAttention's internal memory references
Cause access violation when SageAttention tries to use old pointers

SageAttention expects weights at original memory addresses without transformations between caching and usage.

Solution:

Add conditional bypass in LowVramPatch.call to detect when SageAttention is active (via --use-sage-attention flag) and skip convert_func/set_func calls. This preserves SageAttention's memory reference stability while maintaining PR #10276 benefits for users without SageAttention.

Trade-offs:

When SageAttention is enabled with fp8 models + LoRAs, LoRAs are applied to scaled weights instead of properly scaled weights
Potential quality impact unknown (no issues observed in testing)
Only affects users who explicitly enable SageAttention flag
Users without SageAttention continue to benefit from PR More surgical fix for #10267 #10276

Testing Completed:

RTX 5090, CUDA 12.8, PyTorch 2.7.0, SageAttention 2.1.1
Wan 2.2 fp8 models with multiple LoRAs
Crash eliminated, ~40% SageAttention performance benefit preserved
No visual quality degradation observed
Non-SageAttention workflows unaffected

Testing Requested:

Other GPU architectures (RTX 4090, 3090, etc.)
Different CUDA/PyTorch version combinations
fp8 LoRA quality comparison with SageAttention enabled
Edge cases: mixed fp8/non-fp8 workflows

Open Questions

Is there a better approach to handle the incompatibility?
Should the SageAttention maintainers be involved for a proper fix?
Would a config flag to force bypass on/off be useful?

Files Changed:

comfy/model_patcher.py: 22 insertions, 3 deletions in LowVramPatch.__call__ method

…hanges Problem: After PR Comfy-Org#10276 (commit 139addd) introduced convert_func/set_func for proper fp8 weight scaling during LoRA application, users with SageAttention enabled experience 100% reproducible crashes (Exception 0xC0000005 ACCESS_VIOLATION) during KSampler execution. Root Cause: PR Comfy-Org#10276 added fp8 weight transformations (scale up -> apply LoRA -> scale down) to fix LoRA quality with Wan 2.1/2.2 14B fp8 models. These transformations: 1. Convert weights to float32 and create copies (new memory addresses) 2. Invalidate tensor metadata that SageAttention cached 3. Break SageAttention's internal memory references 4. Cause access violation when SageAttention tries to use old pointers SageAttention expects weights at original memory addresses without transformations between caching and usage. Solution: Add conditional bypass in LowVramPatch.__call__ to detect when SageAttention is active (via --use-sage-attention flag) and skip convert_func/set_func calls. This preserves SageAttention's memory reference stability while maintaining PR Comfy-Org#10276 benefits for users without SageAttention. Trade-offs: - When SageAttention is enabled with fp8 models + LoRAs, LoRAs are applied to scaled weights instead of properly scaled weights - Potential quality impact unknown (no issues observed in testing) - Only affects users who explicitly enable SageAttention flag - Users without SageAttention continue to benefit from PR Comfy-Org#10276 Testing Completed: - RTX 5090, CUDA 12.8, PyTorch 2.7.0, SageAttention 2.1.1 - Wan 2.2 fp8 models with multiple LoRAs - Crash eliminated, ~40% SageAttention performance benefit preserved - No visual quality degradation observed - Non-SageAttention workflows unaffected Testing Requested: - Other GPU architectures (RTX 4090, 3090, etc.) - Different CUDA/PyTorch version combinations - fp8 LoRA quality comparison with SageAttention enabled - Edge cases: mixed fp8/non-fp8 workflows Files Changed: - comfy/model_patcher.py: LowVramPatch.__call__ method Related: - Issue: SageAttention incompatibility with fp8 weight scaling - Original PR: Comfy-Org#10276 (fp8 LoRA quality fix for Wan models) - SageAttention: https://github.com/thu-ml/SageAttention

comfy-pr-bot · 2026-01-22T03:40:07Z

Test Evidence Check

⚠️ Warning: Visual Documentation Missing

If this PR changes user-facing behavior, visual proof (screen recording or screenshot) is required. PRs without applicable visual documentation may not be reviewed until provided.

You can add it by:

GitHub: Drag & drop media directly into the PR description
YouTube: Include a link to a short demo

djdarcy requested a review from Kosinkadink as a code owner October 12, 2025 06:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix SageAttention crash after PR #10276 fp8 weight scaling changes#10304

Fix SageAttention crash after PR #10276 fp8 weight scaling changes#10304
djdarcy wants to merge 1 commit intoComfy-Org:masterfrom
djdarcy:master

djdarcy commented Oct 12, 2025

Uh oh!

comfy-pr-bot commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

djdarcy commented Oct 12, 2025

Problem:

Root Cause:

Solution:

Trade-offs:

Testing Completed:

Testing Requested:

Open Questions

Files Changed:

Related:

Uh oh!

comfy-pr-bot commented Jan 22, 2026

Test Evidence Check

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants