Release 🎉 Flash-DMA v0.1.0 · flash-algo/flash-sparse-attention

We're excited to announce the first official release of Flash-DMA (Flash Dynamic Mask Attention)!

🚀 What is Flash-DMA?

Flash-DMA is a high-performance attention implementation that combines:

Flash Attention's memory efficiency
Dynamic Mask Attention's sparse computation
Support for extremely long sequences (128K+ tokens)

✨ Key Features

🔥 Performance

Sparse Attention: Reduces computation from O(N²) to O(N·w) where w ≪ N
Memory Efficient: Maintains O(N) memory complexity
CUDA Accelerated: Custom sparse GEMM operations at kernel level

🛠️ Multiple Backends

CUDA Backend: Maximum performance with custom kernels
Triton Backend: Flexibility for research and development
Flex Backend: Integration with Transformers library

📏 Long Sequence Support

Efficiently handles sequences of 128K+ tokens
Dynamic masking when sequence length exceeds keep_window_size
Optimized memory layouts for large-scale processing

📦 Installation

Prerequisites

Python 3.9+
PyTorch 2.0+
CUDA 11.8+
NVIDIA GPU with Compute Capability 8.0+

Install from Source

git clone https://github.com/SmallDoges/flash-dmattn.git
cd flash-dmattn
git submodule update --init --recursive
pip install .

What's Changed

Workspace by @LoserCheems in #1
Add namespace_config to csrc by @LoserCheems in #2
Add hardware_info to csrc by @LoserCheems in #3
Add block_info to csrc by @LoserCheems in #4
Add flash params to csrc by @LoserCheems in #5
Workspace by @LoserCheems in #6
Update golobal to Shared Memory operation by @LoserCheems in #7
Update PREDICATES by @LoserCheems in #8
Fix some nits for layout by @LoserCheems in #9
Workspace by @LoserCheems in #10
Fix Dynamic Mask Attention Integration in FlashAttention CUDA Kernel by @Copilot in #12
Fix dynamic mask attention equivalence issue between Python and CUDA by @Copilot in #14
Fix CUDA dynamic mask attention scaling to match Python implementation by @Copilot in #16
Update mask.h by @Evanwu1125 in #17
Comprehensive README improvement with installation, usage examples, and documentation by @Copilot in #19
Adds no-topk variant for kernel performance analysis by @LoserCheems in #20
Optimize sparse GEMM and enable in attention computation by @LoserCheems in #21
Adds row stride support to offset calculation methods by @LoserCheems in #22
Corrects ZOH tensor dimension comment by @LoserCheems in #23
Adds rotary positional encoding operations by @LoserCheems in #24
Adds conditional softcap switch macro by @LoserCheems in #25
Removes unused template parameter from DynamicMask by @LoserCheems in #26
Updates tensor offset calculations and formatting by @LoserCheems in #27
Adds split-K attention kernel with sparse computation by @LoserCheems in #28
Adds dropout support and softcap feature to flash attention by @LoserCheems in #29
Add specialized CUDA kernels for multi-head attention with various head dimensions by @LoserCheems in #30
Remove cub submodule and add cutlass; implement FlashDynamicMaskAttention by @LoserCheems in #31
Refactors setup.py for production-ready package build by @LoserCheems in #32
Update integration by @LoserCheems in #33
Adds comprehensive API reference documentation by @LoserCheems in #34
Updates README with improved technical accuracy and examples by @LoserCheems in #35
Fix bug by @LoserCheems in #36
Adds column stride parameters to ZOH_params struct by @LoserCheems in #37
Adds column stride support to offset calculations by @LoserCheems in #38
Fixes attention benchmarking and expands test coverage by @LoserCheems in #39
Reorders stride parameter assignments by @LoserCheems in #40
Adds column stride support to tensor memory layouts by @LoserCheems in #41
Improves code clarity and test coverage by @LoserCheems in #42
Updates copy function defaults and clarifies comments by @LoserCheems in #43
Updates copy operations to use improved vectorization by @LoserCheems in #44
Updates benchmark test configurations for better coverage by @LoserCheems in #45
Temporarily disables Split-KV feature by @LoserCheems in #46
Optimizes CUDA kernel block sizes for better occupancy by @LoserCheems in #49
Enables test case for 512x512 input dimensions by @LoserCheems in #50
Renames dzero_hold to dzoh and adds column stride by @LoserCheems in #51
Improves code formatting consistency in comments by @LoserCheems in #52
Fixes tensor addressing for ZOH and active mask in splitkv by @LoserCheems in #53
Refactor attention mask and bias structures for clarity by @LoserCheems in #54
Refactor backward kernel for attention mask and bias support by @LoserCheems in #55
Adds Flash Attention implementation with dynamic masking by @LoserCheems in #56
Fixes mask validation in forward kernel by @LoserCheems in #57
Fixes mask comparison and scaling logic in attention kernel by @LoserCheems in #58
Enhance Flash Attention with required parameters and improved backward pass by @LoserCheems in #59
Reorganizes flash attention files into instantiations directory by @LoserCheems in #60
Rename flash_dma to flash_dmattn and improve usability by @LoserCheems in #61
Adds bias gradient computation to backward kernel by @LoserCheems in #62
Add backend selection and dynamic mask attention support by @LoserCheems in #63
Update by @LoserCheems in #64
Removes no-topk CUDA implementation from benchmarks by @LoserCheems in #65
Enables comprehensive benchmark configurations by @LoserCheems in #66
Renames Flash Attention to SDPA in benchmark suite by @LoserCheems in #67
Refactors variable declarations for better readability by @LoserCheems in #68
Add bias gradient computation support in backward kernel by @LoserCheems in #69
Fix function naming and standardize memory copy alignment in attention kernel by @LoserCheems in #70
Adds unified mask application function with causal support by @LoserCheems in #71
Enables Split-KV avoidance and updates error messages by @LoserCheems in #74
Adds variable length forward pass support by @LoserCheems in #75
Simplify attention mask and bias parameter naming by @LoserCheems in #76
Remove unused parameters and simplify mask logic by @LoserCheems in #77
Add CUDA-integrated flash attention interface by @LoserCheems in #78
Improves version comparison using packaging library by @LoserCheems in #79
Refactor CUDA interface for improved usability by @LoserCheems in #80
Refactors CUDA implementation to use new interface by @LoserCheems in #81
Refactors dynamic mask function to improve clarity by @LoserCheems in #82
Expands API documentation with comprehensive interface guide by @LoserCheems in #83
Refactors API to use unified flash attention interface by @LoserCheems in #84
Convert banner image from JPG to PNG format by @LoserCheems in #85
Updates flash attention banner image by @LoserCheems in #86
Updates citation format and adds acknowledgment by @LoserCheems in #87
Adds comprehensive documentation with logo and Chinese translation by @LoserCheems in #89
Simplify and standardize flex attention interface by @LoserCheems in #90
Standardize parameter naming and improve API consistency in attention functions by @LoserCheems in #91
Implement Doge model with dynamic attention mechanism by @LoserCheems in #92
Remove dropout functionality from flash attention by @LoserCheems in #93
Make attention parameters optional with defaults and simplify API documentation by @LoserCheems in #95
Improves API parameter naming consistency by @LoserCheems in #96
Adds GitHub issue templates and PR template by @LoserCheems in #97
Streamlines benchmark suite structure and test scope by @LoserCheems in #99
Update contributor list, add citation metadata, and enhance documentation by @LoserCheems in #100
Remove paper citation and author information from README files by @LoserCheems in #102

New Contributors

@LoserCheems made their first contribution in #1
@Copilot made their first contribution in #12
@Evanwu1125 made their first contribution in #17

Full Changelog: https://github.com/SmallDoges/flash-dmattn/commits/0.1.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🎉 Flash-DMA v0.1.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

🚀 What is Flash-DMA?

✨ Key Features

🔥 Performance

🛠️ Multiple Backends

📏 Long Sequence Support

📦 Installation

Prerequisites

Install from Source

What's Changed

New Contributors

Contributors

Uh oh!