Optimize Flash Dynamic Mask Attention Kernel Configurations #148
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR fixes issue #132 by:
Overall: simpler launch logic, improved large‑sequence throughput, and lower maintenance complexity.
Type of Change
Related Issues
Changes Made
Code / Kernel Launch Logic
kBlockM = kBlockN = 64.__grid_constant__usage when arch ≥ sm80.Documentation / Comments
Testing
Functional validation (numerical equivalence) done via existing benchmark / equivalence scripts for head dims {32, 64, 96, 128, 192, 256}, seq lens up to 32K (and extreme K-only scaling cases). Gradients match within FP16/BF16 tolerance relative to SDPA reference.
nvcc --ptxas-options=-v)Test Configuration (example)
Performance Impact
Summary
Forward pass sees consistent latency reductions for medium–large sequence lengths with head dim 32 (up to ~27% at 2K–4K tokens; 11–18% at very large lengths), while very small K-extreme cases remain dominated by memory and show neutral or expected under‑utilization (unchanged behavior). Head dim 64+ likewise exhibits strong speedups or maintains parity with previous best variant. Backward pass achieves large speedups vs SDPA (6–11× for long sequences) with stable scaling; D64 variant is slightly slower than SDPA for the tiniest shapes (expected due to launch overhead), but overtakes quickly as sequence length grows.
Forward (Head Dim 32) – Old (128×64 tile) vs New (64×64 tile)
Average reduction (geometric, excluding tiny 256/512): ~19.5%.
Forward (Head Dim 64) – New Kernel (64×64)
Representative large‑sequence speedups vs SDPA:
Forward (Head Dim 96) – Two Candidate Configs (80 KB vs 52 KB Variant)
Both 64×64-based; lower shared memory variant (52 KB) improved large‑sequence throughput (e.g., 32K tokens: 7.93 ms → 7.23 ms, ~9%).
Long-Context & Windowed Variants
Window sizes (W) from 32 → 32768 maintain high relative speedups (20–50×) for large sequence lengths, confirming robust behavior across sliding window scenarios.
Backward (Head Dim 32) – CUDA vs SDPA
Selected points:
Backward (Head Dim 64)
Smaller shapes (256) show SDPA marginally faster (expected overhead dominance), but crossover occurs quickly.
Notes
Breaking Changes
None.
Checklist
tests/later)CUDA-specific
Performance Data Sources
Additional Notes