You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+20-7Lines changed: 20 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -17,11 +17,12 @@ Flash-DMA is a high-performance attention implementation that integrates Flash A
17
17
18
18
## Key Features
19
19
20
-
-**Sparse Attention Computation**: Dynamically selects the most important keys for each query, reducing computation from $O(N^2)$ to $O(N \cdot w)$ where $w \ll N$.
21
-
-**Memory Efficiency**: Maintains Flash Attention's $O(N)$ memory complexity without materializing the full attention matrix.
22
-
-**CUDA-Accelerated**: Deep integration at the CUDA kernel level with custom sparse GEMM operations for maximum performance.
23
-
-**Long Sequence Support**: Efficiently handles sequences of 128K+ tokens through dynamic masking when sequence length exceeds `keep_window_size`.
24
-
-**Advanced Integration**: Complete integration from Python frontend to CUDA backend with optimized memory layouts and sparse computation strategies.
20
+
-**Dynamic Sparse Attention**: Dynamically selects the most relevant keys for each query, reducing computational complexity from $O(N^2)$ to $O(N \cdot w)$ where $w \ll N$, supporting trainable sparse patterns.
21
+
-**Memory Efficiency**: Maintains Flash Attention's $O(N)$ memory complexity without instantiating the full attention matrix.
22
+
-**CUDA Deep Optimization**: Utilizes custom CUDA kernels with shared memory aliasing, pipelined prefetching, and block skipping for high throughput and low memory access overhead.
23
+
-**Extremely Long Context Support**: Handles 128K+ token sequences efficiently through dynamic mask windowing while preserving accuracy.
24
+
-**Learnable Bias**: Built-in learnable attention bias and its gradient path dbias, eliminating the need for additional external operators.
25
+
-**Fusion-Friendly Training**: Both forward and backward passes support block-level zero-mask skipping, further reducing computation in sparse scenarios.
25
26
26
27
27
28
## Performance
@@ -129,7 +130,7 @@ The integration happens at the CUDA kernel level with several key components:
129
130
130
131
-**ZOH States**: Pre-computed importance scores for key selection
131
132
-**Active Masks**: Binary masks indicating which keys should be considered for each query
132
-
-**Sparse Matrix Multiplication**: Custom CUDA kernels for efficient sparse attention computation
133
+
-**Sparse Skipping**: Custom CUDA kernels for efficient sparse attention computation
133
134
-**Block-Based Processing**: Maintains Flash Attention's block-based approach for memory efficiency
134
135
135
136
This creates a hybrid attention mechanism that achieves both memory and computational efficiency for long sequences.
0 commit comments