Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paper
-
Updated
Aug 15, 2025 - Python
Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paper
Trainable fast and memory-efficient sparse attention
[ICML2025, NeurIPS2025 Spotlight] Sparse VideoGen 1 & 2: Accelerating Video Diffusion Transformers with Sparse Attention
[NeurIPS 2025] Radial Attention: O(nlogn) Sparse Attention with Energy Decay for Long Video Generation
SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse–Linear Attention
[ICML 2025 Spotlight] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Efficient triton implementation of Native Sparse Attention.
Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
[CoLM'25] The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>
[ICLR 2026] SparseD: Sparse Attention for Diffusion Language Models
Advancing the frontier of efficient AI
Vortex: A Flexible and Efficient Sparse Attention Framework
[TIP-2025] Official Pytorch implementation of "Structural Similarity-Inspired Unfolding for Lightweight Image Super-Resolution"
Official repository for "SSA: Sparse Sparse Attention by Aligning Full and Sparse Attention Outputs in Feature Space"
Demo code for CVPR2023 paper "Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient Vision Transformers"
Memory-bounded compressed sparse attention via streaming top-k. Triton kernels for the DeepSeek-V4 lightning indexer. 32x regime extension on a single H200 | by RightNow https://www.rightnowai.co/
Dynamic Attention Mask (DAM) generate adaptive sparse attention masks per layer and head for Transformer models, enabling long-context inference with lower compute and memory overhead without fine-tuning.
The code implementation of paper "VORTA: Efficient Video Diffusion via Routing Sparse Attention"
From-scratch reimplementation of DeepSeek's Native Sparse Attention (arXiv:2502.11089) in Triton + CUDA Hopper WGMMA. 7.4x faster than FlashAttention-3 at 64k context. Five-model training fleet, perplexity sweep, LongBench v2, MoBA comparison.
From-scratch, paper-faithful PyTorch implementation of DeepSeek-V4 architecture for transparent study, testing, ablation, and mini-scale training.
Add a description, image, and links to the sparse-attention topic page so that developers can more easily learn about it.
To associate your repository with the sparse-attention topic, visit your repo's landing page and select "manage topics."