From 7d5664a941c18e9341d170338c56b9f88e140388 Mon Sep 17 00:00:00 2001 From: yucheng lab Date: Fri, 26 Apr 2024 12:01:39 +0800 Subject: [PATCH] add SnapKV --- README.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 9f6d4da..9c64364 100644 --- a/README.md +++ b/README.md @@ -159,7 +159,8 @@ Awesome-LLM-Inference: A curated list of [📙Awesome LLM Inference Papers with |2023.05|🔥🔥[**GQA**] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints(@Google) | [[pdf]](https://arxiv.org/pdf/2305.13245.pdf)|[[flaxformer]](https://github.com/google/flaxformer) ![](https://img.shields.io/github/stars/google/flaxformer.svg?style=social) |⭐️⭐️ | |2023.05|[KV Cache Compress] Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time(@)|[[pdf]](https://arxiv.org/pdf/2305.17118.pdf)|⚠️|⭐️⭐️ | |2023.06|[H2O] H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models(@Rice University etc)|[[pdf]](https://arxiv.org/pdf/2306.14048.pdf)|[[H2O]](https://github.com/FMInference/H2O) ![](https://img.shields.io/github/stars/FMInference/H2O.svg?style=social) |⭐️ | -|2023.06|[QK-Sparse/Dropping Attention] Faster Causal Attention Over Large Sequences Through Sparse Flash Attention(@EPFL etc) |[[pdf]](https://arxiv.org/pdf/2306.01160.pdf) | [[dynamic-sparse-flash-attention]](https://github.com/epfml/dynamic-sparse-flash-attention) ![](https://img.shields.io/github/stars/epfml/dynamic-sparse-flash-attention.svg?style=social)|⭐️ | +|2023.06|[QK-Sparse/Dropping Attention] Faster Causal Attention Over Large Sequences Through Sparse Flash Attention(@EPFL etc) |[[pdf]](https://arxiv.org/pdf/2306.01160.pdf) | [[dynamic-sparse-flash-attention]](https://github.com/epfml/dynamic-sparse-flash-attention) ![](https://img.shields.io/github/stars/epfml/dynamic-sparse-flash-attention.svg?style=social)|⭐️ | +|2023.08|🔥🔥[Chunked Prefills] SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills(@Microsoft etc) | [[pdf]](https://arxiv.org/pdf/2308.16369.pdf)|⚠️|⭐️⭐️ | |2023.09|🔥🔥[**PagedAttention**] Efficient Memory Management for Large Language Model Serving with PagedAttention(@UC Berkeley etc) |[[pdf]](https://arxiv.org/pdf/2309.06180.pdf)|[[vllm]](https://github.com/vllm-project/vllm) ![](https://img.shields.io/github/stars/vllm-project/vllm.svg?style=social)|⭐️⭐️ | |2023.09|[KV Cache FP8 + WINT4] Exploration on LLM inference performance optimization(@HPC4AI) | [[blog]](https://zhuanlan.zhihu.com/p/653735572)|⚠️|⭐️ | |2023.10|🔥[**TensorRT-LLM KV Cache FP8**] NVIDIA TensorRT LLM(@NVIDIA) |[[docs]](https://nvidia.github.io/TensorRT-LLM/precision.html)|[[TensorRT-LLM]](https://github.com/NVIDIA/TensorRT-LLM) ![](https://img.shields.io/github/stars/NVIDIA/TensorRT-LLM.svg?style=social) |⭐️⭐️ | @@ -171,14 +172,15 @@ Awesome-LLM-Inference: A curated list of [📙Awesome LLM Inference Papers with |2024.02|🔥🔥[Prompt Caching] Efficient Prompt Caching via Embedding Similarity(@UC Berkeley)|[[pdf]](https://arxiv.org/pdf/2402.01173.pdf)|⚠️|⭐️⭐️ | |2024.02|🔥🔥[Less] Get More with LESS: Synthesizing Recurrence with KV Cache Compression for Efficient LLM Inference(@CMU etc)|[[pdf]](https://arxiv.org/pdf/2402.09398.pdf)|⚠️|⭐️ | |2024.02|🔥🔥[MiKV] No Token Left Behind: Reliable KV Cache Compression via Importance-Aware Mixed Precision Quantization(@KAIST)|[[pdf]](https://arxiv.org/pdf/2402.18096.pdf)|⚠️|⭐️ | +|2024.02|🔥🔥[Shared Prefixes] Hydragen: High-Throughput LLM Inference with Shared Prefixes | [[pdf]](https://arxiv.org/pdf/2402.05099.pdf)|⚠️|⭐️⭐️ | |2024.03|🔥[QAQ] QAQ: Quality Adaptive Quantization for LLM KV Cache(@@smail.nju.edu.cn)|[[pdf]](https://arxiv.org/pdf/2403.04643.pdf)|[[QAQ-KVCacheQuantization]](https://github.com/ClubieDong/QAQ-KVCacheQuantization) ![](https://img.shields.io/github/stars/ClubieDong/QAQ-KVCacheQuantization.svg?style=social) |⭐️⭐️ | |2024.03|🔥🔥[DMC] Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference(@NVIDIA etc)|[[pdf]](https://arxiv.org/pdf/2403.09636.pdf)|⚠️|⭐️⭐️ | |2024.03|🔥🔥[Keyformer] Keyformer: KV Cache reduction through key tokens selection for Efficient Generative Inference(@ece.ubc.ca etc)|[[pdf]](https://arxiv.org/pdf/2403.09054.pdf)|⚠️|⭐️⭐️ | |2024.03|[FASTDECODE] FASTDECODE: High-Throughput GPU-Efficient LLM Serving using Heterogeneous(@Tsinghua University)|[[pdf]](https://arxiv.org/pdf/2403.11421.pdf)|⚠️|⭐️⭐️ | |2024.03|[Sparsity-Aware KV Caching] ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching(@ucf.edu)|[[pdf]](https://arxiv.org/pdf/2403.17312.pdf)|⚠️|⭐️⭐️ | |2024.04|[SqueezeAttention] SQUEEZEATTENTION: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget(@lzu.edu.cn etc)|[[pdf]](https://arxiv.org/pdf/2404.04793.pdf)|[[SqueezeAttention]](https://github.com/hetailang/SqueezeAttention) ![](https://img.shields.io/github/stars/hetailang/SqueezeAttention.svg?style=social) |⭐️⭐️ | -|2024.02|🔥🔥[Shared Prefixes] Hydragen: High-Throughput LLM Inference with Shared Prefixes | [[pdf]](https://arxiv.org/pdf/2402.05099.pdf)|⚠️|⭐️⭐️ | -|2023.08|🔥🔥[Chunked Prefills] SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills(@Microsoft etc) | [[pdf]](https://arxiv.org/pdf/2308.16369.pdf)|⚠️|⭐️⭐️ | +|2024.04|[SnapKV] SnapKV: LLM Knows What You are Looking for Before Generation(@UIUC)|[[pdf]](https://arxiv.org/pdf/2404.14469)|[[SnapKV]](https://github.com/FasterDecoding/SnapKV) ![](https://img.shields.io/github/stars/FasterDecoding/SnapKV.svg?style=social)|⭐️ | + ### 📖Prompt/Context Compression ([©️back👆🏻](#paperlist))