Merge pull request #134 from SmallDoges/optimize-sparse-logic

LoserCheems · web-flow · commit ccc0b0153cdd · 2025-08-29T09:07:10.000+08:00
Fix block size condition and enhance documentation
diff --git a/README.md b/README.md
@@ -17,11 +17,12 @@ Flash-DMA is a high-performance attention implementation that integrates Flash A
 
 ## Key Features
 
-- **Sparse Attention Computation**: Dynamically selects the most important keys for each query, reducing computation from $O(N^2)$ to $O(N \cdot w)$ where $w \ll N$.
-- **Memory Efficiency**: Maintains Flash Attention's $O(N)$ memory complexity without materializing the full attention matrix.
-- **CUDA-Accelerated**: Deep integration at the CUDA kernel level with custom sparse GEMM operations for maximum performance.
-- **Long Sequence Support**: Efficiently handles sequences of 128K+ tokens through dynamic masking when sequence length exceeds `keep_window_size`.
-- **Advanced Integration**: Complete integration from Python frontend to CUDA backend with optimized memory layouts and sparse computation strategies.
+- **Dynamic Sparse Attention**: Dynamically selects the most relevant keys for each query, reducing computational complexity from $O(N^2)$ to $O(N \cdot w)$ where $w \ll N$, supporting trainable sparse patterns.
+- **Memory Efficiency**: Maintains Flash Attention's $O(N)$ memory complexity without instantiating the full attention matrix.
+- **CUDA Deep Optimization**: Utilizes custom CUDA kernels with shared memory aliasing, pipelined prefetching, and block skipping for high throughput and low memory access overhead.
+- **Extremely Long Context Support**: Handles 128K+ token sequences efficiently through dynamic mask windowing while preserving accuracy.
+- **Learnable Bias**: Built-in learnable attention bias and its gradient path dbias, eliminating the need for additional external operators.
+- **Fusion-Friendly Training**: Both forward and backward passes support block-level zero-mask skipping, further reducing computation in sparse scenarios.
 
 
 ## Performance
@@ -129,7 +130,7 @@ The integration happens at the CUDA kernel level with several key components:
 
 - **ZOH States**: Pre-computed importance scores for key selection
 - **Active Masks**: Binary masks indicating which keys should be considered for each query
-- **Sparse Matrix Multiplication**: Custom CUDA kernels for efficient sparse attention computation
+- **Sparse Skipping**: Custom CUDA kernels for efficient sparse attention computation
 - **Block-Based Processing**: Maintains Flash Attention's block-based approach for memory efficiency
 
 This creates a hybrid attention mechanism that achieves both memory and computational efficiency for long sequences.
@@ -185,12 +186,24 @@ python benchmarks/forward_equivalence.py
 ```
 Validates numerical consistency between Python reference and CUDA implementation.
 
-### Performance Benchmarking  
+### Forward Pass Performance Benchmarking
 ```bash
 python benchmarks/forward_performance.py
 ```
 Compares Flash-DMA against standard SDPA across various sequence lengths and batch sizes.
 
+### Backward Pass Equivalence
+```bash
+python benchmarks/backward_equivalence.py
+```
+Validates numerical consistency between Python reference and CUDA implementation.
+
+### Backward Pass Performance Benchmarking
+```bash
+python benchmarks/backward_performance.py
+```
+Compares Flash-DMA against standard SDPA across various sequence lengths and batch sizes.
+
 ### Gradient Computation
 ```bash
 python benchmarks/grad_equivalence.py
diff --git a/README_zh.md b/README_zh.md
@@ -17,11 +17,12 @@ Flash-DMA 是一个高性能的注意力实现，将 Flash Attention 的内存
 
 ## 主要特性
 
-- **稀疏注意力计算**: 为每个查询动态选择最重要的键，将计算复杂度从 $O(N^2)$ 降低到 $O(N \cdot w)$，其中 $w \ll N$。
+- **动态稀疏注意力**: 为每个查询动态选择最重要的键，将计算复杂度从 $O(N^2)$ 降低到 $O(N \cdot w)$，其中 $w \ll N$，支持可训练的稀疏结构。
 - **内存效率**: 保持 Flash Attention 的 $O(N)$ 内存复杂度，无需实例化完整的注意力矩阵。
-- **CUDA 加速**: 在 CUDA 内核层面深度集成，采用自定义稀疏 GEMM 运算以获得最佳性能。
-- **长序列支持**: 当序列长度超过 `keep_window_size` 时，通过动态掩码高效处理 128K+ 标记的序列。
-- **高级集成**: 从 Python 前端到 CUDA 后端的完整集成，具有优化的内存布局和稀疏计算策略。
+- **CUDA 深度优化**：使用自定义 CUDA Kernel, 含共享内存别名、流水线预取、按块跳过, 实现高吞吐与低访存开销。
+- **超长上下文支持**：通过动态掩码窗口裁剪，在保持精度的前提下支撑 128K+ 令牌级别的上下文处理。
+- **可学习偏置**：内置可学习 attention bias 及其梯度反向路径 dbias，无需额外外部算子。
+- **融合式训练友好**：正向与反向过程均支持 block 级全零掩码跳过，在稀疏场景进一步降低计算开销。
 
 
 ## 性能
@@ -129,7 +130,7 @@ Flash-DMA 结合了两种互补的技术：
 
 - **ZOH 状态**: 预计算的键选择重要性分数
 - **活跃掩码**: 指示每个查询应考虑哪些键的二进制掩码
-- **稀疏矩阵乘法**: 高效稀疏注意力计算的自定义 CUDA 内核
+- **稀疏跳过**: 高效稀疏注意力计算的自定义 CUDA 内核
 - **分块处理**: 保持 Flash Attention 的分块方法以提高内存效率
 
 这创建了一种混合注意力机制，为长序列实现了内存和计算效率。
@@ -184,12 +185,24 @@ python benchmarks/forward_equivalence.py
 ```
 验证 Python 参考实现与 CUDA 实现之间的数值一致性。
 
-### 性能基准测试  
+### 前向传播性能基准测试  
 ```bash
 python benchmarks/forward_performance.py
 ```
 在各种序列长度和批大小下比较 Flash-DMA 与标准 SDPA。
 
+### 反向传播等效性
+```bash
+python benchmarks/backward_equivalence.py
+```
+验证 Python 参考实现与 CUDA 实现之间的数值一致性。
+
+### 反向传播性能基准测试
+```bash
+python benchmarks/backward_performance.py
+```
+比较 Flash-DMA 与标准 SDPA 在各种序列长度和批大小下的性能。
+
 ### 梯度计算
 ```bash
 python benchmarks/grad_equivalence.py
diff --git a/csrc/flash_api.cpp b/csrc/flash_api.cpp
@@ -298,7 +298,7 @@ std::tuple<at::Tensor, at::Tensor> set_params_splitkv(
 ) {
 
     // This needs to match with run_mha_fwd_splitkv_dispatch
-    const int block_n = head_size <= 64 ? 64 : (head_size < 128 ? 64 : 32);
+    const int block_n = head_size <= 64 ? 64 : (head_size <= 128 ? 64 : 32);
     const int num_n_blocks = (max_seqlen_k + block_n - 1) / block_n;
     // Technically kBlockM = 64 only for the splitKV kernels, not the standard kernel.
     // In any case we don't expect seqlen_q to be larger than 64 for inference.
diff --git a/csrc/src/flash_fwd_launch_template.h b/csrc/src/flash_fwd_launch_template.h
@@ -155,7 +155,7 @@ void run_flash_splitkv_fwd(Flash_fwd_params &params, cudaStream_t stream) {
 template<typename T, int Headdim, bool Is_causal>
 void run_mha_fwd_splitkv_dispatch(Flash_fwd_params &params, cudaStream_t stream) {
     constexpr static int kBlockM = 64;  // Fixed for all head dimensions
-    constexpr static int kBlockN = Headdim <= 64 ? 64 : (Headdim < 128 ? 64 : 32);
+    constexpr static int kBlockN = Headdim <= 64 ? 64 : (Headdim <= 128 ? 64 : 32);
     run_flash_splitkv_fwd<Flash_fwd_kernel_traits<Headdim, kBlockM, kBlockN, 4, false, false, T>, Is_causal>(params, stream);
 }
 

Original file line number	Diff line number	Diff line change
`@@ -155,7 +155,7 @@ void run_flash_splitkv_fwd(Flash_fwd_params &params, cudaStream_t stream) {`
`155`	`155`	`template<typename T, int Headdim, bool Is_causal>`
`156`	`156`	`void run_mha_fwd_splitkv_dispatch(Flash_fwd_params &params, cudaStream_t stream) {`
`157`	`157`	`constexpr static int kBlockM = 64; // Fixed for all head dimensions`
`158`		`- constexpr static int kBlockN = Headdim <= 64 ? 64 : (Headdim < 128 ? 64 : 32);`
	`158`	`+ constexpr static int kBlockN = Headdim <= 64 ? 64 : (Headdim <= 128 ? 64 : 32);`
`159`	`159`	`run_flash_splitkv_fwd<Flash_fwd_kernel_traits<Headdim, kBlockM, kBlockN, 4, false, false, T>, Is_causal>(params, stream);`
`160`	`160`	`}`
`161`	`161`