Consolidate mask and bias memory operations #129

LoserCheems · 2025-08-26T16:05:34Z

Unify mask and bias layouts and memory operations to reduce code duplication and improve memory access patterns. Introduce configurable block sizes and optimize kernel selection logic for better performance across GPU architectures. Enhance attention mask handling and expand test coverage for comprehensive validation. Adjust block sizes for improved efficiency with smaller head dimensions.

Replaces separate SmemLayoutAtomMask and SmemLayoutAtomBias with a single SmemLayoutAtomPS to reduce code duplication and improve memory layout consistency. Introduces kPBlockN parameter with configurable block sizes (16, 32, or 64) and dedicated swizzle pattern for better memory access patterns. Unifies global memory copy operations for mask and bias into a single GmemTiledCopyMaskBias with improved 128-bit alignment and 8 values per read.

Consolidates separate mask and bias memory copy operations into a unified approach by replacing distinct layout types and copy operations with shared AtomPS layout and MaskBias copy operations. Reduces code duplication and improves memory access patterns by using the same layout configuration for both mask and bias tensors in shared memory operations.

Eliminates the kPBlockN constant and its static assertion since it was not being used effectively in the layout configuration. Simplifies the SmemLayoutAtomPS composition by directly using kBlockN instead of the intermediate kPBlockN parameter.

Improves kernel selection logic by dynamically querying device shared memory limits instead of relying on hardcoded compute capability checks. Uses actual shared memory per block availability to determine optimal kernel configurations, enabling better performance across different GPU architectures without requiring architecture-specific branching logic. Simplifies the codebase by removing compute capability detection and associated conditional logic while maintaining performance optimization goals.

Fixes dynamic mask preparation to properly handle invalid topk values by checking against minimum dtype values before scattering to attention mask. Expands benchmark test configurations to include comprehensive coverage across multiple head dimensions (32, 64, 96, 128, 256) and sequence lengths, providing more thorough validation of attention mechanisms. Re-enables previously disabled triton and flex attention test suites to ensure complete equivalence testing across all implementation variants.

Increases block size from 64 to 128 for head dimensions <= 32 to improve memory throughput and computational efficiency for smaller attention heads. The nested ternary operator now handles three cases: - Head dim <= 32: uses 128 block size - Head dim <= 64: uses 64 block size - Head dim >= 128: uses 32 block size

Copilot

Pull Request Overview

This PR consolidates mask and bias memory operations to unify their layouts and reduce code duplication. The changes improve memory access patterns through configurable block sizes and optimize kernel selection logic for better performance across GPU architectures.

Consolidate mask and bias memory layouts into a unified "PS" (Pointwise-Sparse) layout system
Replace architecture-based kernel selection with dynamic shared memory allocation checking
Enhance test coverage with expanded configurations and re-enable previously disabled tests

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
csrc/src/kernel_traits.h	Unifies mask and bias layouts into SmemLayoutAtomPS, consolidates memory copy operations
csrc/src/flash_fwd_launch_template.h	Replaces compute capability checks with shared memory size checks for kernel selection
csrc/src/flash_fwd_kernel.h	Updates kernel code to use unified mask/bias copy operations and layout types
benchmarks/forward_equivalence.py	Expands test coverage with additional head dimensions and re-enables triton/flex tests

Comments suppressed due to low confidence (4)

csrc/src/kernel_traits.h:221

Using SmemLayoutAtomPS instead of SmemLayoutPS for mask tensor creation. This creates a layout mismatch since SmemLayoutAtomPS is the atomic layout while SmemLayoutPS is the full tiled layout that should be used for tensor creation.

    int AtomLayoutMSdP_=1, int AtomLayoutNdKV=2, int AtomLayoutMdQ=2,

csrc/src/kernel_traits.h:225

Using SmemLayoutAtomPS instead of SmemLayoutPS for bias tensor creation. This creates a layout mismatch since SmemLayoutAtomPS is the atomic layout while SmemLayoutPS is the full tiled layout that should be used for tensor creation.

struct Flash_bwd_kernel_traits : public Base {

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-08-26T16:07:00Z

benchmarks/forward_equivalence.py

+        (1, 2, 1, 4096, 4096, 96, False),
+
+        (1, 2, 1, 128, 128, 128, True),
        (1, 2, 1, 128, 128, 128, True),


Duplicate test configuration. The same test case (1, 2, 1, 128, 128, 128, True) appears twice in the test_configs list, which is redundant and increases test execution time unnecessarily.

Suggested change

(1, 2, 1, 128, 128, 128, True),

Copilot · 2025-08-26T16:07:01Z

benchmarks/forward_equivalence.py

+        (1, 2, 1, 4096, 4096, 96, False),
+
+        (1, 2, 1, 128, 128, 128, True),
        (1, 2, 1, 128, 128, 128, True),


Duplicate test configuration in triton test function. The same test case (1, 2, 1, 128, 128, 128, True) appears twice in the test_configs list, which is redundant and increases test execution time unnecessarily.

Suggested change

(1, 2, 1, 128, 128, 128, True),

Copilot · 2025-08-26T16:07:01Z

benchmarks/forward_equivalence.py

+        (1, 2, 1, 4096, 4096, 96, False),
+
+        (1, 2, 1, 128, 128, 128, True),
        (1, 2, 1, 128, 128, 128, True),


Duplicate test configuration in flex test function. The same test case (1, 2, 1, 128, 128, 128, True) appears twice in the test_configs list, which is redundant and increases test execution time unnecessarily.

Suggested change

(1, 2, 1, 128, 128, 128, True),

Copilot · 2025-08-26T16:07:01Z

benchmarks/forward_equivalence.py

+        (1, 2, 1, 128, 128, 128, True),
+        (1, 2, 1, 128, 128, 128, False),


Duplicate test configuration in flex test function. The test case (1, 2, 1, 128, 128, 128, True) appears again, and there's also a duplicate (1, 2, 1, 128, 128, 128, False) that should be removed to avoid redundant testing.

Suggested change

(1, 2, 1, 128, 128, 128, True),

(1, 2, 1, 128, 128, 128, False),

# (1, 2, 1, 128, 128, 128, True), # Removed duplicate

# (1, 2, 1, 128, 128, 128, False), # Removed duplicate

Enhances the prepare_dynamic_mask function by capturing both values and indices from torch.topk operation, then filtering out invalid entries based on minimum dtype values. This prevents invalid indices from being included in the attention mask. Updates inline comments to standardize "INF" terminology and removes outdated debugging comments from test cases.

Improves the dynamic mask preparation by properly handling cases where topk values are invalid (equal to minimum dtype value). Previously, the mask would incorrectly include positions with invalid attention scores, potentially causing incorrect attention computations. Now validates topk values before setting mask positions, ensuring only valid attention scores are preserved in the final mask.

LoserCheems added 6 commits August 26, 2025 18:57

LoserCheems requested review from Evanwu1125, SNHuan, Thanksyy, Copilot and wubingheng111 August 26, 2025 16:05

LoserCheems assigned Thanksyy, SNHuan, Evanwu1125, wubingheng111, LoserCheems and Copilot Aug 26, 2025

Copilot AI reviewed Aug 26, 2025

View reviewed changes

LoserCheems added 2 commits August 27, 2025 00:07

LoserCheems merged commit 931ac86 into main Aug 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Consolidate mask and bias memory operations #129

Consolidate mask and bias memory operations #129

Uh oh!

LoserCheems commented Aug 26, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Aug 26, 2025

Uh oh!

Copilot AI Aug 26, 2025

Uh oh!

Copilot AI Aug 26, 2025

Uh oh!

Copilot AI Aug 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

		(1, 2, 1, 128, 128, 128, True),
		(1, 2, 1, 128, 128, 128, False),

Consolidate mask and bias memory operations #129

Consolidate mask and bias memory operations #129

Uh oh!

Conversation

LoserCheems commented Aug 26, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants