🎉 Flash-DMA v0.1.0
We're excited to announce the first official release of Flash-DMA (Flash Dynamic Mask Attention)!
🚀 What is Flash-DMA?
Flash-DMA is a high-performance attention implementation that combines:
- Flash Attention's memory efficiency
- Dynamic Mask Attention's sparse computation
- Support for extremely long sequences (128K+ tokens)
✨ Key Features
🔥 Performance
- Sparse Attention: Reduces computation from O(N²) to O(N·w) where w ≪ N
- Memory Efficient: Maintains O(N) memory complexity
- CUDA Accelerated: Custom sparse GEMM operations at kernel level
🛠️ Multiple Backends
- CUDA Backend: Maximum performance with custom kernels
- Triton Backend: Flexibility for research and development
- Flex Backend: Integration with Transformers library
📏 Long Sequence Support
- Efficiently handles sequences of 128K+ tokens
- Dynamic masking when sequence length exceeds
keep_window_size - Optimized memory layouts for large-scale processing
📦 Installation
Prerequisites
- Python 3.9+
- PyTorch 2.0+
- CUDA 11.8+
- NVIDIA GPU with Compute Capability 8.0+
Install from Source
git clone https://github.com/SmallDoges/flash-dmattn.git
cd flash-dmattn
git submodule update --init --recursive
pip install .What's Changed
- Workspace by @LoserCheems in #1
- Add namespace_config to csrc by @LoserCheems in #2
- Add hardware_info to csrc by @LoserCheems in #3
- Add block_info to csrc by @LoserCheems in #4
- Add flash params to csrc by @LoserCheems in #5
- Workspace by @LoserCheems in #6
- Update golobal to Shared Memory operation by @LoserCheems in #7
- Update PREDICATES by @LoserCheems in #8
- Fix some nits for layout by @LoserCheems in #9
- Workspace by @LoserCheems in #10
- Fix Dynamic Mask Attention Integration in FlashAttention CUDA Kernel by @Copilot in #12
- Fix dynamic mask attention equivalence issue between Python and CUDA by @Copilot in #14
- Fix CUDA dynamic mask attention scaling to match Python implementation by @Copilot in #16
- Update mask.h by @Evanwu1125 in #17
- Comprehensive README improvement with installation, usage examples, and documentation by @Copilot in #19
- Adds no-topk variant for kernel performance analysis by @LoserCheems in #20
- Optimize sparse GEMM and enable in attention computation by @LoserCheems in #21
- Adds row stride support to offset calculation methods by @LoserCheems in #22
- Corrects ZOH tensor dimension comment by @LoserCheems in #23
- Adds rotary positional encoding operations by @LoserCheems in #24
- Adds conditional softcap switch macro by @LoserCheems in #25
- Removes unused template parameter from DynamicMask by @LoserCheems in #26
- Updates tensor offset calculations and formatting by @LoserCheems in #27
- Adds split-K attention kernel with sparse computation by @LoserCheems in #28
- Adds dropout support and softcap feature to flash attention by @LoserCheems in #29
- Add specialized CUDA kernels for multi-head attention with various head dimensions by @LoserCheems in #30
- Remove cub submodule and add cutlass; implement FlashDynamicMaskAttention by @LoserCheems in #31
- Refactors setup.py for production-ready package build by @LoserCheems in #32
- Update integration by @LoserCheems in #33
- Adds comprehensive API reference documentation by @LoserCheems in #34
- Updates README with improved technical accuracy and examples by @LoserCheems in #35
- Fix bug by @LoserCheems in #36
- Adds column stride parameters to ZOH_params struct by @LoserCheems in #37
- Adds column stride support to offset calculations by @LoserCheems in #38
- Fixes attention benchmarking and expands test coverage by @LoserCheems in #39
- Reorders stride parameter assignments by @LoserCheems in #40
- Adds column stride support to tensor memory layouts by @LoserCheems in #41
- Improves code clarity and test coverage by @LoserCheems in #42
- Updates copy function defaults and clarifies comments by @LoserCheems in #43
- Updates copy operations to use improved vectorization by @LoserCheems in #44
- Updates benchmark test configurations for better coverage by @LoserCheems in #45
- Temporarily disables Split-KV feature by @LoserCheems in #46
- Optimizes CUDA kernel block sizes for better occupancy by @LoserCheems in #49
- Enables test case for 512x512 input dimensions by @LoserCheems in #50
- Renames dzero_hold to dzoh and adds column stride by @LoserCheems in #51
- Improves code formatting consistency in comments by @LoserCheems in #52
- Fixes tensor addressing for ZOH and active mask in splitkv by @LoserCheems in #53
- Refactor attention mask and bias structures for clarity by @LoserCheems in #54
- Refactor backward kernel for attention mask and bias support by @LoserCheems in #55
- Adds Flash Attention implementation with dynamic masking by @LoserCheems in #56
- Fixes mask validation in forward kernel by @LoserCheems in #57
- Fixes mask comparison and scaling logic in attention kernel by @LoserCheems in #58
- Enhance Flash Attention with required parameters and improved backward pass by @LoserCheems in #59
- Reorganizes flash attention files into instantiations directory by @LoserCheems in #60
- Rename flash_dma to flash_dmattn and improve usability by @LoserCheems in #61
- Adds bias gradient computation to backward kernel by @LoserCheems in #62
- Add backend selection and dynamic mask attention support by @LoserCheems in #63
- Update by @LoserCheems in #64
- Removes no-topk CUDA implementation from benchmarks by @LoserCheems in #65
- Enables comprehensive benchmark configurations by @LoserCheems in #66
- Renames Flash Attention to SDPA in benchmark suite by @LoserCheems in #67
- Refactors variable declarations for better readability by @LoserCheems in #68
- Add bias gradient computation support in backward kernel by @LoserCheems in #69
- Fix function naming and standardize memory copy alignment in attention kernel by @LoserCheems in #70
- Adds unified mask application function with causal support by @LoserCheems in #71
- Enables Split-KV avoidance and updates error messages by @LoserCheems in #74
- Adds variable length forward pass support by @LoserCheems in #75
- Simplify attention mask and bias parameter naming by @LoserCheems in #76
- Remove unused parameters and simplify mask logic by @LoserCheems in #77
- Add CUDA-integrated flash attention interface by @LoserCheems in #78
- Improves version comparison using packaging library by @LoserCheems in #79
- Refactor CUDA interface for improved usability by @LoserCheems in #80
- Refactors CUDA implementation to use new interface by @LoserCheems in #81
- Refactors dynamic mask function to improve clarity by @LoserCheems in #82
- Expands API documentation with comprehensive interface guide by @LoserCheems in #83
- Refactors API to use unified flash attention interface by @LoserCheems in #84
- Convert banner image from JPG to PNG format by @LoserCheems in #85
- Updates flash attention banner image by @LoserCheems in #86
- Updates citation format and adds acknowledgment by @LoserCheems in #87
- Adds comprehensive documentation with logo and Chinese translation by @LoserCheems in #89
- Simplify and standardize flex attention interface by @LoserCheems in #90
- Standardize parameter naming and improve API consistency in attention functions by @LoserCheems in #91
- Implement Doge model with dynamic attention mechanism by @LoserCheems in #92
- Remove dropout functionality from flash attention by @LoserCheems in #93
- Make attention parameters optional with defaults and simplify API documentation by @LoserCheems in #95
- Improves API parameter naming consistency by @LoserCheems in #96
- Adds GitHub issue templates and PR template by @LoserCheems in #97
- Streamlines benchmark suite structure and test scope by @LoserCheems in #99
- Update contributor list, add citation metadata, and enhance documentation by @LoserCheems in #100
- Remove paper citation and author information from README files by @LoserCheems in #102
New Contributors
- @LoserCheems made their first contribution in #1
- @Copilot made their first contribution in #12
- @Evanwu1125 made their first contribution in #17
Full Changelog: https://github.com/SmallDoges/flash-dmattn/commits/0.1.0