Releases: DefTruth/Awesome-LLM-Inference
Releases · DefTruth/Awesome-LLM-Inference
v2.6.11
What's Changed
- add
MiniMax-01
in Trending LLM/VLM Topics and Long Context Attention by @shaoyuyoung in #112 - [feat] add deepseek-r1 by @shaoyuyoung in #113
- 🔥🔥[DistServe] DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving by @DefTruth in #114
- 🔥🔥[KVDirect] KVDirect: Distributed Disaggregated LLM Inference by @DefTruth in #115
- 🔥🔥[DeServe] DESERVE: TOWARDS AFFORDABLE OFFLINE LLM INFERENCE VIA DECENTRALIZATION by @DefTruth in #116
- 🔥🔥[Mooncake] Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving by @DefTruth in #117
New Contributors
- @shaoyuyoung made their first contribution in #112
Full Changelog: v2.6.10...v2.6.11
v2.6.10
What's Changed
- 🔥🔥🔥[DeepSeek-V3] DeepSeek-V3 Technical Report by @DefTruth in #109
- 🔥🔥[SP: TokenRing] TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication by @DefTruth in #110
- 🔥🔥[FFPA] FFPA: Yet another Faster Flash Prefill Attention with O(1) SRAM complexity for headdim > 256, ~1.5x faster than SDPA EA(@DefTruth) by @DefTruth in #111
Full Changelog: v2.6.9...v2.6.10
v2.6.9
What's Changed
- 🔥🔥[TurboAttention] TURBOATTENTION: EFFICIENT ATTENTION APPROXIMATION FOR HIGH THROUGHPUTS LLMS by @DefTruth in #105
- 🔥🔥[NITRO] NITRO: LLM INFERENCE ON INTEL® LAPTOP NPUS by @DefTruth in #106
- 🔥[DynamicKV] DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs by @DefTruth in #107
- 🔥🔥[HADACORE] HADACORE: TENSOR CORE ACCELERATED HADAMARD TRANSFORM KERNEL by @DefTruth in #108
Full Changelog: v2.6.8...v2.6.9
v2.6.8
What's Changed
- 🔥[ClusterKV] ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression by @DefTruth in #103
- 🔥[BatchLLM] BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching by @DefTruth in #104
Full Changelog: v2.6.7...v2.6.8
v2.6.7
v2.6.6
What's Changed
- Add code link to BPT by @DefTruth in #95
- add vAttention code link by @KevinZeng08 in #96
- 🔥[SageAttention] SAGEATTENTION: ACCURATE 8-BIT ATTENTION FOR PLUG-AND-PLAY INFERENCE ACCELERATION(@thu-ml) by @DefTruth in #97
- 🔥[SageAttention-2] SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration(@thu-ml) by @DefTruth in #98
- 🔥[Squeezed Attention] SQUEEZED ATTENTION: Accelerating Long Context Length LLM Inference(@uc Berkeley) by @DefTruth in #99
- 🔥[SparseInfer] SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference by @DefTruth in #100
New Contributors
- @KevinZeng08 made their first contribution in #96
Full Changelog: v2.6.5...v2.6.6
v2.6.5
v2.6.4
v2.6.3
v2.6.2
What's Changed
- early exit of LLM inference by @boyi-liu in #85
- Add paper AdaKV by @FFY0 in #86
- Efficient Hybrid Inference for LLMs: Reward-Based Token Modelling with Selective Cloud Assistance by @aharshms in #87
- 🔥[FastAttention] FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs for Efficient Inference by @DefTruth in #88
New Contributors
- @boyi-liu made their first contribution in #85
- @FFY0 made their first contribution in #86
- @aharshms made their first contribution in #87
Full Changelog: v2.6.1...v2.6.2