Fix SageAttention crash after PR #10276 fp8 weight scaling changes#10304
Open
djdarcy wants to merge 1 commit intoComfy-Org:masterfrom
Open
Fix SageAttention crash after PR #10276 fp8 weight scaling changes#10304djdarcy wants to merge 1 commit intoComfy-Org:masterfrom
djdarcy wants to merge 1 commit intoComfy-Org:masterfrom
Conversation
…hanges Problem: After PR Comfy-Org#10276 (commit 139addd) introduced convert_func/set_func for proper fp8 weight scaling during LoRA application, users with SageAttention enabled experience 100% reproducible crashes (Exception 0xC0000005 ACCESS_VIOLATION) during KSampler execution. Root Cause: PR Comfy-Org#10276 added fp8 weight transformations (scale up -> apply LoRA -> scale down) to fix LoRA quality with Wan 2.1/2.2 14B fp8 models. These transformations: 1. Convert weights to float32 and create copies (new memory addresses) 2. Invalidate tensor metadata that SageAttention cached 3. Break SageAttention's internal memory references 4. Cause access violation when SageAttention tries to use old pointers SageAttention expects weights at original memory addresses without transformations between caching and usage. Solution: Add conditional bypass in LowVramPatch.__call__ to detect when SageAttention is active (via --use-sage-attention flag) and skip convert_func/set_func calls. This preserves SageAttention's memory reference stability while maintaining PR Comfy-Org#10276 benefits for users without SageAttention. Trade-offs: - When SageAttention is enabled with fp8 models + LoRAs, LoRAs are applied to scaled weights instead of properly scaled weights - Potential quality impact unknown (no issues observed in testing) - Only affects users who explicitly enable SageAttention flag - Users without SageAttention continue to benefit from PR Comfy-Org#10276 Testing Completed: - RTX 5090, CUDA 12.8, PyTorch 2.7.0, SageAttention 2.1.1 - Wan 2.2 fp8 models with multiple LoRAs - Crash eliminated, ~40% SageAttention performance benefit preserved - No visual quality degradation observed - Non-SageAttention workflows unaffected Testing Requested: - Other GPU architectures (RTX 4090, 3090, etc.) - Different CUDA/PyTorch version combinations - fp8 LoRA quality comparison with SageAttention enabled - Edge cases: mixed fp8/non-fp8 workflows Files Changed: - comfy/model_patcher.py: LowVramPatch.__call__ method Related: - Issue: SageAttention incompatibility with fp8 weight scaling - Original PR: Comfy-Org#10276 (fp8 LoRA quality fix for Wan models) - SageAttention: https://github.com/thu-ml/SageAttention
Member
Test Evidence CheckIf this PR changes user-facing behavior, visual proof (screen recording or screenshot) is required. PRs without applicable visual documentation may not be reviewed until provided. You can add it by:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem:
After PR #10276 (commit 139addd) introduced convert_func/set_func for proper fp8 weight scaling during LoRA application, users with SageAttention enabled experience 100% reproducible crashes (Exception 0xC0000005 ACCESS_VIOLATION) during KSampler execution.
Root Cause:
PR #10276 added fp8 weight transformations (scale up -> apply LoRA -> scale down) to fix LoRA quality with Wan 2.1/2.2 14B fp8 models. These transformations:
SageAttention expects weights at original memory addresses without transformations between caching and usage.
Solution:
Add conditional bypass in LowVramPatch.call to detect when SageAttention is active (via --use-sage-attention flag) and skip convert_func/set_func calls. This preserves SageAttention's memory reference stability while maintaining PR #10276 benefits for users without SageAttention.
Trade-offs:
Testing Completed:
Testing Requested:
Open Questions
Files Changed:
comfy/model_patcher.py: 22 insertions, 3 deletions inLowVramPatch.__call__methodRelated: