Skip to content

Releases: DefTruth/Awesome-LLM-Inference

v2.6.11

31 Jan 06:54
d7914c0
Compare
Choose a tag to compare

What's Changed

  • add MiniMax-01 in Trending LLM/VLM Topics and Long Context Attention by @shaoyuyoung in #112
  • [feat] add deepseek-r1 by @shaoyuyoung in #113
  • 🔥🔥[DistServe] DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving by @DefTruth in #114
  • 🔥🔥[KVDirect] KVDirect: Distributed Disaggregated LLM Inference by @DefTruth in #115
  • 🔥🔥[DeServe] DESERVE: TOWARDS AFFORDABLE OFFLINE LLM INFERENCE VIA DECENTRALIZATION by @DefTruth in #116
  • 🔥🔥[Mooncake] Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving by @DefTruth in #117

New Contributors

Full Changelog: v2.6.10...v2.6.11

v2.6.10

06 Jan 06:12
b8b3a43
Compare
Choose a tag to compare

What's Changed

  • 🔥🔥🔥[DeepSeek-V3] DeepSeek-V3 Technical Report by @DefTruth in #109
  • 🔥🔥[SP: TokenRing] TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication by @DefTruth in #110
  • 🔥🔥[FFPA] FFPA: Yet another Faster Flash Prefill Attention with O(1) SRAM complexity for headdim > 256, ~1.5x faster than SDPA EA(@DefTruth) by @DefTruth in #111

Full Changelog: v2.6.9...v2.6.10

v2.6.9

22 Dec 08:04
6ad7b30
Compare
Choose a tag to compare

What's Changed

  • 🔥🔥[TurboAttention] TURBOATTENTION: EFFICIENT ATTENTION APPROXIMATION FOR HIGH THROUGHPUTS LLMS by @DefTruth in #105
  • 🔥🔥[NITRO] NITRO: LLM INFERENCE ON INTEL® LAPTOP NPUS by @DefTruth in #106
  • 🔥[DynamicKV] DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs by @DefTruth in #107
  • 🔥🔥[HADACORE] HADACORE: TENSOR CORE ACCELERATED HADAMARD TRANSFORM KERNEL by @DefTruth in #108

Full Changelog: v2.6.8...v2.6.9

v2.6.8

09 Dec 01:22
32fdb84
Compare
Choose a tag to compare

What's Changed

  • 🔥[ClusterKV] ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression by @DefTruth in #103
  • 🔥[BatchLLM] BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching by @DefTruth in #104

Full Changelog: v2.6.7...v2.6.8

v2.6.7

02 Dec 05:30
9f548f6
Compare
Choose a tag to compare

What's Changed

  • 🔥[Star-Attention: 11x~ speedup] Star Attention: Efficient LLM Inference over Long Sequences by @DefTruth in #101
  • 🔥[KV Cache Recomputation] Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation by @DefTruth in #102

Full Changelog: v2.6.6...v2.6.7

v2.6.6

25 Nov 03:22
40292d7
Compare
Choose a tag to compare

What's Changed

  • Add code link to BPT by @DefTruth in #95
  • add vAttention code link by @KevinZeng08 in #96
  • 🔥[SageAttention] SAGEATTENTION: ACCURATE 8-BIT ATTENTION FOR PLUG-AND-PLAY INFERENCE ACCELERATION(@thu-ml) by @DefTruth in #97
  • 🔥[SageAttention-2] SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration(@thu-ml) by @DefTruth in #98
  • 🔥[Squeezed Attention] SQUEEZED ATTENTION: Accelerating Long Context Length LLM Inference(@uc Berkeley) by @DefTruth in #99
  • 🔥[SparseInfer] SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference by @DefTruth in #100

New Contributors

Full Changelog: v2.6.5...v2.6.6

v2.6.5

18 Nov 02:53
06c76ad
Compare
Choose a tag to compare

What's Changed

  • Add DP/TP/SP/CP papers with codes by @DefTruth in #92
  • 🔥🔥[SP: BPT] Blockwise Parallel Transformer for Large Context Models by @DefTruth in #93
  • 🔥🔥[TP: Comm Compression] Communication Compression for Tensor Parallel LLM Inference by @DefTruth in #94

Full Changelog: v2.6.4...v2.6.5

v2.6.4

13 Nov 07:02
f3f27a7
Compare
Choose a tag to compare

What's Changed

  • 🔥[BitNet] BitNet a4.8: 4-bit Activations for 1-bit LLMs by @DefTruth in #91

Full Changelog: v2.6.3...v2.6.4

v2.6.3

01 Nov 01:18
a854d6c
Compare
Choose a tag to compare

What's Changed

  • 🔥[Fast Best-of-N] Fast Best-of-N Decoding via Speculative Rejection by @DefTruth in #89
  • 🔥[Tensor Product] Acceleration of Tensor-Product Operations with Tensor Cores by @DefTruth in #90

Full Changelog: v2.6.2...v2.6.3

v2.6.2

28 Oct 02:38
613300d
Compare
Choose a tag to compare

What's Changed

  • early exit of LLM inference by @boyi-liu in #85
  • Add paper AdaKV by @FFY0 in #86
  • Efficient Hybrid Inference for LLMs: Reward-Based Token Modelling with Selective Cloud Assistance by @aharshms in #87
  • 🔥[FastAttention] FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs for Efficient Inference by @DefTruth in #88

New Contributors

Full Changelog: v2.6.1...v2.6.2