Skip to content

🎉 Flash-DMA v0.1.0

Choose a tag to compare

@LoserCheems LoserCheems released this 10 Aug 12:38
· 541 commits to main since this release
802613e

We're excited to announce the first official release of Flash-DMA (Flash Dynamic Mask Attention)!

🚀 What is Flash-DMA?

Flash-DMA is a high-performance attention implementation that combines:

  • Flash Attention's memory efficiency
  • Dynamic Mask Attention's sparse computation
  • Support for extremely long sequences (128K+ tokens)

✨ Key Features

🔥 Performance

  • Sparse Attention: Reduces computation from O(N²) to O(N·w) where w ≪ N
  • Memory Efficient: Maintains O(N) memory complexity
  • CUDA Accelerated: Custom sparse GEMM operations at kernel level

🛠️ Multiple Backends

  • CUDA Backend: Maximum performance with custom kernels
  • Triton Backend: Flexibility for research and development
  • Flex Backend: Integration with Transformers library

📏 Long Sequence Support

  • Efficiently handles sequences of 128K+ tokens
  • Dynamic masking when sequence length exceeds keep_window_size
  • Optimized memory layouts for large-scale processing

📦 Installation

Prerequisites

  • Python 3.9+
  • PyTorch 2.0+
  • CUDA 11.8+
  • NVIDIA GPU with Compute Capability 8.0+

Install from Source

git clone https://github.com/SmallDoges/flash-dmattn.git
cd flash-dmattn
git submodule update --init --recursive
pip install .

What's Changed

  • Workspace by @LoserCheems in #1
  • Add namespace_config to csrc by @LoserCheems in #2
  • Add hardware_info to csrc by @LoserCheems in #3
  • Add block_info to csrc by @LoserCheems in #4
  • Add flash params to csrc by @LoserCheems in #5
  • Workspace by @LoserCheems in #6
  • Update golobal to Shared Memory operation by @LoserCheems in #7
  • Update PREDICATES by @LoserCheems in #8
  • Fix some nits for layout by @LoserCheems in #9
  • Workspace by @LoserCheems in #10
  • Fix Dynamic Mask Attention Integration in FlashAttention CUDA Kernel by @Copilot in #12
  • Fix dynamic mask attention equivalence issue between Python and CUDA by @Copilot in #14
  • Fix CUDA dynamic mask attention scaling to match Python implementation by @Copilot in #16
  • Update mask.h by @Evanwu1125 in #17
  • Comprehensive README improvement with installation, usage examples, and documentation by @Copilot in #19
  • Adds no-topk variant for kernel performance analysis by @LoserCheems in #20
  • Optimize sparse GEMM and enable in attention computation by @LoserCheems in #21
  • Adds row stride support to offset calculation methods by @LoserCheems in #22
  • Corrects ZOH tensor dimension comment by @LoserCheems in #23
  • Adds rotary positional encoding operations by @LoserCheems in #24
  • Adds conditional softcap switch macro by @LoserCheems in #25
  • Removes unused template parameter from DynamicMask by @LoserCheems in #26
  • Updates tensor offset calculations and formatting by @LoserCheems in #27
  • Adds split-K attention kernel with sparse computation by @LoserCheems in #28
  • Adds dropout support and softcap feature to flash attention by @LoserCheems in #29
  • Add specialized CUDA kernels for multi-head attention with various head dimensions by @LoserCheems in #30
  • Remove cub submodule and add cutlass; implement FlashDynamicMaskAttention by @LoserCheems in #31
  • Refactors setup.py for production-ready package build by @LoserCheems in #32
  • Update integration by @LoserCheems in #33
  • Adds comprehensive API reference documentation by @LoserCheems in #34
  • Updates README with improved technical accuracy and examples by @LoserCheems in #35
  • Fix bug by @LoserCheems in #36
  • Adds column stride parameters to ZOH_params struct by @LoserCheems in #37
  • Adds column stride support to offset calculations by @LoserCheems in #38
  • Fixes attention benchmarking and expands test coverage by @LoserCheems in #39
  • Reorders stride parameter assignments by @LoserCheems in #40
  • Adds column stride support to tensor memory layouts by @LoserCheems in #41
  • Improves code clarity and test coverage by @LoserCheems in #42
  • Updates copy function defaults and clarifies comments by @LoserCheems in #43
  • Updates copy operations to use improved vectorization by @LoserCheems in #44
  • Updates benchmark test configurations for better coverage by @LoserCheems in #45
  • Temporarily disables Split-KV feature by @LoserCheems in #46
  • Optimizes CUDA kernel block sizes for better occupancy by @LoserCheems in #49
  • Enables test case for 512x512 input dimensions by @LoserCheems in #50
  • Renames dzero_hold to dzoh and adds column stride by @LoserCheems in #51
  • Improves code formatting consistency in comments by @LoserCheems in #52
  • Fixes tensor addressing for ZOH and active mask in splitkv by @LoserCheems in #53
  • Refactor attention mask and bias structures for clarity by @LoserCheems in #54
  • Refactor backward kernel for attention mask and bias support by @LoserCheems in #55
  • Adds Flash Attention implementation with dynamic masking by @LoserCheems in #56
  • Fixes mask validation in forward kernel by @LoserCheems in #57
  • Fixes mask comparison and scaling logic in attention kernel by @LoserCheems in #58
  • Enhance Flash Attention with required parameters and improved backward pass by @LoserCheems in #59
  • Reorganizes flash attention files into instantiations directory by @LoserCheems in #60
  • Rename flash_dma to flash_dmattn and improve usability by @LoserCheems in #61
  • Adds bias gradient computation to backward kernel by @LoserCheems in #62
  • Add backend selection and dynamic mask attention support by @LoserCheems in #63
  • Update by @LoserCheems in #64
  • Removes no-topk CUDA implementation from benchmarks by @LoserCheems in #65
  • Enables comprehensive benchmark configurations by @LoserCheems in #66
  • Renames Flash Attention to SDPA in benchmark suite by @LoserCheems in #67
  • Refactors variable declarations for better readability by @LoserCheems in #68
  • Add bias gradient computation support in backward kernel by @LoserCheems in #69
  • Fix function naming and standardize memory copy alignment in attention kernel by @LoserCheems in #70
  • Adds unified mask application function with causal support by @LoserCheems in #71
  • Enables Split-KV avoidance and updates error messages by @LoserCheems in #74
  • Adds variable length forward pass support by @LoserCheems in #75
  • Simplify attention mask and bias parameter naming by @LoserCheems in #76
  • Remove unused parameters and simplify mask logic by @LoserCheems in #77
  • Add CUDA-integrated flash attention interface by @LoserCheems in #78
  • Improves version comparison using packaging library by @LoserCheems in #79
  • Refactor CUDA interface for improved usability by @LoserCheems in #80
  • Refactors CUDA implementation to use new interface by @LoserCheems in #81
  • Refactors dynamic mask function to improve clarity by @LoserCheems in #82
  • Expands API documentation with comprehensive interface guide by @LoserCheems in #83
  • Refactors API to use unified flash attention interface by @LoserCheems in #84
  • Convert banner image from JPG to PNG format by @LoserCheems in #85
  • Updates flash attention banner image by @LoserCheems in #86
  • Updates citation format and adds acknowledgment by @LoserCheems in #87
  • Adds comprehensive documentation with logo and Chinese translation by @LoserCheems in #89
  • Simplify and standardize flex attention interface by @LoserCheems in #90
  • Standardize parameter naming and improve API consistency in attention functions by @LoserCheems in #91
  • Implement Doge model with dynamic attention mechanism by @LoserCheems in #92
  • Remove dropout functionality from flash attention by @LoserCheems in #93
  • Make attention parameters optional with defaults and simplify API documentation by @LoserCheems in #95
  • Improves API parameter naming consistency by @LoserCheems in #96
  • Adds GitHub issue templates and PR template by @LoserCheems in #97
  • Streamlines benchmark suite structure and test scope by @LoserCheems in #99
  • Update contributor list, add citation metadata, and enhance documentation by @LoserCheems in #100
  • Remove paper citation and author information from README files by @LoserCheems in #102

New Contributors

Full Changelog: https://github.com/SmallDoges/flash-dmattn/commits/0.1.0