[Feature request] Sparse Attention

Recently, we see several awesome work focusing on kv cache compressing and they said can accelearte 1.7~2.3 times than FlashInfer, can you guys plz consider to surpport such features?

Same layer KV:
Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
<img width="502" alt="image" src="https://github.com/flashinfer-ai/flashinfer/assets/11551984/f05b0804-a6ed-4d17-bf37-ff8c1d0cfe0f">

MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression

Cross layer KV:
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature request] Sparse Attention #367

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature request] Sparse Attention #367

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions