Improve masking and memory management in backward kernel #128

LoserCheems · 2025-08-26T10:14:06Z

Fix: #121
Enhance masking logic for exact value matching, optimize memory layout for mask and bias tensors, and streamline shared memory operations to improve performance and readability. Address issues with sparsity detection and bias gradient pointer management for better computational efficiency. Cleanup includes removing unused code and adding type hints for clarity.

Updates mask condition logic to use exact equality comparison instead of less-than-or-equal, ensuring only zero values trigger masking behavior. Removes large block of commented-out alternative implementation code to clean up the codebase and improve readability.

Changes comparison from > 0 to != 0.0f to properly detect all non-zero elements including negative values in sparse matrix operations. Previously only positive values were considered active, which could lead to incorrect sparsity patterns when matrices contain negative elements.

Moves the bias gradient pointer advancement to the prologue section to ensure proper memory alignment and consistent pointer management throughout the computation loop. Changes the clear out-of-bounds flag to true for improved memory safety when copying bias gradients to global memory.

Merges separate mask and bias tiled copy types into a unified implementation. Increases alignment from 64 to 128 bits and vectorization from 4 to 8 values per operation, improving memory bandwidth utilization.

Unifies separate mask and bias global memory tiled copy objects into a single shared copy handler to reduce memory overhead and improve kernel efficiency. Adds missing tensor partition for bias gradient computation to ensure proper memory layout handling during backward pass operations.

Removes redundant layout definitions and reuses existing layout structure to reduce code duplication and memory overhead. Previously defined separate layout atoms and arrangements which duplicated the same configuration as the existing PdS layout.

Consolidates shared memory layout usage by replacing separate SmemLayoutMask and SmemLayoutBias with SmemLayoutPdS for both mask and bias tensors. Removes redundant sdBias tensor and associated copy operations, streamlining memory management and reducing code duplication. Reorganizes bias copying to occur after softcap application, improving computational flow and memory access patterns.

Eliminates kSmemdSSize variable and its usage in memory calculations to reduce shared memory footprint in backward kernel. Comments out the unused variable definition and removes it from both kSmemSize and kSmemSize1colblock calculations, optimizing memory usage without affecting functionality.

Adjusts shared memory thresholds and kernel trait parameters across different head dimensions to improve performance on H100 and A100 GPUs. Reduces memory requirements while maintaining or improving computational efficiency by fine-tuning block sizes, memory layout parameters, and GPU-specific optimizations. Consolidates some GPU target categories where similar configurations work well across both H100 and A100 architectures.

Swaps memory allocation order between gradient and probability tensors to optimize memory usage pattern. Changes sdS tensors to use bias data directly while moving sP tensor to use the previous sdS location, maintaining total memory footprint while improving data locality.

Removes commented-out debugging breakpoints that are no longer needed for development. Adds type annotations to improve code clarity and enable better static analysis. Fixes inconsistent return tuple lengths in backward functions to match expected parameter counts.

Copilot

Pull Request Overview

This PR optimizes the FlashAttention backward kernel implementation by improving memory management and computational efficiency. The changes focus on consolidating shared memory layouts for mask and bias tensors, refining sparsity detection logic, and streamlining memory operations.

Key Changes:

Enhanced masking logic to use exact equality comparison (== 0.0f) instead of inequality (<= 0) for improved precision
Unified memory layout for mask and bias tensors to reduce shared memory usage and simplify memory management
Adjusted shared memory size calculations and kernel configurations to optimize for different GPU architectures

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
flash_dmattn/flash_dmattn_interface.py	Added type hints and removed debug code, adjusted return tuple sizes
csrc/src/utils.h	Updated sparsity detection comparison from `> 0` to `!= 0.0f`
csrc/src/mask.h	Changed mask comparison from `<= 0.0f` to `== 0.0f` and removed commented code
csrc/src/kernel_traits.h	Unified mask and bias memory layouts, removed separate layouts and updated memory size calculations
csrc/src/flash_bwd_launch_template.h	Updated shared memory thresholds and kernel configurations for different GPU architectures
csrc/src/flash_bwd_kernel.h	Streamlined shared memory operations, unified copy operations for mask/bias, and optimized memory pointer management

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-08-26T10:15:59Z

flash_dmattn/flash_dmattn_interface.py



-def maybe_contiguous(x):
+def maybe_contiguous(x: Optional[torch.Tensor]) -> Optional[torch.Tensor]:


The type hint imports Optional but the import statement is missing. Add from typing import Optional at the top of the file.

Copilot · 2025-08-26T10:16:00Z

csrc/src/utils.h

            for (int n = 0; n < size<2>(tCrM) && !local_any_active; ++n) {
                // Use direct comparison to avoid potential branching
-                local_any_active |= (tCrM(mma, m, n) > 0);
+                local_any_active |= (tCrM(mma, m, n) != 0.0f);


Using exact equality comparison with floating-point values can be problematic due to precision issues. Consider using a small epsilon for comparison: abs(tCrM(mma, m, n)) > epsilon where epsilon is a small threshold like 1e-9f.

Suggested change

local_any_active |= (tCrM(mma, m, n) != 0.0f);

local_any_active |= (fabsf(tCrM(mma, m, n)) > FLASH_EPSILON);

Copilot · 2025-08-26T10:16:00Z

csrc/src/utils.h

                // Use direct comparison to avoid potential branching
-                local_any_active |= (tCrM(mma, m, n) > 0);
+                local_any_active |= (tCrM(mma, m, n) != 0.0f);


Using exact equality comparison with floating-point values can be problematic due to precision issues. Consider using a small epsilon for comparison: abs(tCrM(mma, m, n)) > epsilon where epsilon is a small threshold like 1e-9f.

Copilot · 2025-08-26T10:16:00Z

csrc/src/mask.h

                    auto coord = make_coord(make_coord(i, mi), make_coord(j, nj));
                    // Apply scaling and bias or masking
-                    tensor(coord) = (col_idx >= col_idx_limit) || (mask(coord) <= 0.0f)
+                    tensor(coord) = (col_idx >= col_idx_limit) || (mask(coord) == 0.0f)


Using exact equality comparison with floating-point values can be problematic due to precision issues. Consider using a small epsilon for comparison or ensuring the mask values are exactly 0.0f or 1.0f.

Suggested change

tensor(coord) = (col_idx >= col_idx_limit) || (mask(coord) == 0.0f)

tensor(coord) = (col_idx >= col_idx_limit) || (fabsf(mask(coord)) < EPSILON)

Copilot · 2025-08-26T10:16:00Z

csrc/src/mask.h

                        auto coord = make_coord(make_coord(i, mi), make_coord(j, nj));
                        // Apply scaling and bias or masking
-                        tensor(coord) = (col_idx >= col_idx_limit) || (mask(coord) <= 0.0f)
+                        tensor(coord) = (col_idx >= col_idx_limit) || (mask(coord) == 0.0f)


Using exact equality comparison with floating-point values can be problematic due to precision issues. Consider using a small epsilon for comparison or ensuring the mask values are exactly 0.0f or 1.0f.

Suggested change

tensor(coord) = (col_idx >= col_idx_limit) || (mask(coord) == 0.0f)

tensor(coord) = (col_idx >= col_idx_limit) || (fabsf(mask(coord)) < EPSILON)

LoserCheems added 11 commits August 26, 2025 12:27

Consolidates mask and bias copy operations

1c3407d

Merges separate mask and bias tiled copy types into a unified implementation. Increases alignment from 64 to 128 bits and vectorization from 4 to 8 values per operation, improving memory bandwidth utilization.

LoserCheems requested review from Evanwu1125, SNHuan, Thanksyy, Copilot and wubingheng111 August 26, 2025 10:14

LoserCheems assigned Thanksyy, SNHuan, Evanwu1125, wubingheng111, LoserCheems and Copilot Aug 26, 2025

LoserCheems added the bug Something isn't working label Aug 26, 2025

Copilot AI reviewed Aug 26, 2025

View reviewed changes

LoserCheems merged commit 7f727ab into main Aug 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve masking and memory management in backward kernel #128

Improve masking and memory management in backward kernel #128

Uh oh!

LoserCheems commented Aug 26, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Aug 26, 2025

Uh oh!

Copilot AI Aug 26, 2025

Uh oh!

Copilot AI Aug 26, 2025

Uh oh!

Copilot AI Aug 26, 2025

Uh oh!

Copilot AI Aug 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants



		def maybe_contiguous(x):
		def maybe_contiguous(x: Optional[torch.Tensor]) -> Optional[torch.Tensor]:

	local_any_active \|= (tCrM(mma, m, n) != 0.0f);
	local_any_active \|= (fabsf(tCrM(mma, m, n)) > FLASH_EPSILON);

	tensor(coord) = (col_idx >= col_idx_limit) \|\| (mask(coord) == 0.0f)
	tensor(coord) = (col_idx >= col_idx_limit) \|\| (fabsf(mask(coord)) < EPSILON)

Improve masking and memory management in backward kernel #128

Improve masking and memory management in backward kernel #128

Uh oh!

Conversation

LoserCheems commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Key Changes:

Reviewed Changes

Uh oh!

Copilot AI Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

LoserCheems commented Aug 26, 2025 •

edited

Loading