[Core] Support sparse KV cache framework #5752

chizhang118 · 2024-06-21T20:31:34Z

Motivation

For current large model inference, KV cache occupies a significant portion of GPU memory, so reducing the size of KV cache is an important direction for improvement. Recently, several papers have approached this issue from different angles, detailed comparison in the table, including:

FastDecode: This method offloads all computation of KV cache to the CPU. The computation and storage of KV cache occurs on CPU.
Compression methods based on quantization (GEAR, Mixed Precision): By applying various quantization techniques, the size of individual token KV caches is reduced without decreasing the number of tokens stored in the KV cache. This method may also result in corresponding residual and outlier matrices, which need to be stored in memory but not in the KV cache. It may also involve quantizing unimportant token KV caches to reduce the memory footprint of the KV cache.
Partial KV cache eviction (H2O, SnapKV, LESS, Adaptive Compression, Scissorhands, Dynamic Memory Compression, StreamingLLM): By removing some relatively useless KV cache entries, the memory footprint of the KV cache is reduced. Essentially, this reduces the number of tokens stored in the KV cache without reducing the size of individual token KV caches.

When addressing the sparse KV cache issue, we have previously considered supporting quantization (VLLM has already implemented this), implementing quantization + outlier + residual like GEAR (not widely applicable as it requires generating outlier and residual for each token generation, which is costly), and implementing KV cache accumulation + appendix (not widely applicable as it requires models to be trained using the same method). Finally, the idea is to implement partial KV cache eviction, primarily aiming for generality and abstraction rather than being specific to one or two approaches. Considering that six of the sparse KV cache methods we found are based on evicting cache entries, this method is also suitable for modification as part of a framework to be integrated into VLLM.

Sparse KV Cache Workflow

First, let's clarify the required parameters, including:

An optional flag "--sparse-kv-cache-type" indicating if we want to specify any sparse KV cache type. Default is ‘auto’ without using any sparse KV cache type, otherwise, there could be various methods, such as attention scores for H2O.
Compression ratio for evicting KV cache entries: 20% if we want to achieve 80% reduction of KV cache usage. We can calculate the value of 'n' for recreating KV cache every 'n' step based on the compression ratio.

The entire workflow includes:

During the first decoding pass, besides computing the KV values for all input tokens, we also need to calculate and retain information about the priority ranking of all token pairs, such as attention scores in H2O.
During each scheduling of VLLM, we need to check whether 'n' steps have been completed, indicating the necessity for KV cache compression. If necessary, based on the priority ranking of tokens, one or more new KV cache blocks will be allocated, modifying the position information of input positions. The block manager will then manage the transfer of corresponding KV blocks from the original sequence group to the latest KV block. Finally, the reference count of the original KV block will be decremented, and the corresponding original KV blocks may even be released.
The corresponding KV values are added to the KV cache until the next compression of the KV cache after 'n' steps, repeating this process until the entire process is completed.

How to Run

python3 examples/sparse_kv_cache.py

Limitation

More tests will be added and some refactoring on model support will be implemented after collecting the feedback from RFC.
Now sparse KV cache is only supported for opt models with eager mode. Moreover, only block_manager_v1 and GPU worker are supported in functionality. The CPU worker, block_manager_v2 and other related models will be supported in the near future.

Links

RFC link: [RFC]: Support sparse KV cache framework #5751
Design doc: https://docs.google.com/document/d/13_cpb31P9VOmPGa_tZ70s7z1vXGP_UenXf1WVuIppCk

chizhang118 · 2024-06-27T19:53:30Z

As suggested in #5751, @cadedaniel @WoosukKwon Could you help to take a look at the block manager changes and paged attention kernel interface changes? Thanks!

drikster80 · 2024-08-01T19:58:11Z

Any idea what the status of this PR is? With models rapidly growing in size, the KV Cache continues to take up a substantial amount of the memory. KV cache Implementations like this would be a great competitive advantage and allow larger models to run on less hardware...

jiangguochaoGG · 2024-08-27T09:06:32Z

I am very interested in the implementation of sparse KV Cache. What is the current status of this PR? Is it still in progress?

mergify · 2024-11-26T05:52:16Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @chizhang118.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

youyc22 · 2025-01-25T09:50:21Z

What is the current status of this PR? Is it still in progress?

ehuaa · 2025-02-24T13:28:14Z

Hi @chizhang118, I'm very interested in your design doc to see the detail of sparse kv cache framework. but it's not available now, may you share a valid link? Thanks!

hmellor · 2025-02-28T13:47:28Z

Closing as stale. If you plan to continue this work, feel free to re-open or make a new PR.

This was referenced Jun 21, 2024

[RFC]: Support sparse KV cache framework #5751

Open

[Feature]: Sparse KV cache implementation bytedance-iaas/vllm#11

Closed

[Core] Support sparse KV cache

c3fc6c0

chizhang118 force-pushed the feat/sparse-cache branch from bbf2456 to c3fc6c0 Compare June 21, 2024 20:43

chizhang118 changed the title ~~[Core] Support sparse KV cache~~ [Core] Support sparse KV cache framework Jun 21, 2024

simon-mo requested review from zhuohan123, youkaichao, alexm-redhat, comaniac, njhill and WoosukKwon as code owners November 26, 2024 05:49

mergify bot added the needs-rebase label Nov 26, 2024

hmellor closed this Feb 28, 2025

mergify bot added the documentation Improvements or additions to documentation label Feb 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Core] Support sparse KV cache framework #5752

[Core] Support sparse KV cache framework #5752

Uh oh!

chizhang118 commented Jun 21, 2024 •

edited

Loading

Uh oh!

chizhang118 commented Jun 27, 2024 •

edited

Loading

Uh oh!

drikster80 commented Aug 1, 2024

Uh oh!

jiangguochaoGG commented Aug 27, 2024

Uh oh!

mergify bot commented Nov 26, 2024

Uh oh!

youyc22 commented Jan 25, 2025

Uh oh!

ehuaa commented Feb 24, 2025

Uh oh!

hmellor commented Feb 28, 2025

Uh oh!

Uh oh!

Uh oh!

[Core] Support sparse KV cache framework #5752

[Core] Support sparse KV cache framework #5752

Uh oh!

Conversation

chizhang118 commented Jun 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Sparse KV Cache Workflow

How to Run

Limitation

Links

Uh oh!

chizhang118 commented Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drikster80 commented Aug 1, 2024

Uh oh!

jiangguochaoGG commented Aug 27, 2024

Uh oh!

mergify bot commented Nov 26, 2024

Uh oh!

youyc22 commented Jan 25, 2025

Uh oh!

ehuaa commented Feb 24, 2025

Uh oh!

hmellor commented Feb 28, 2025

Uh oh!

Uh oh!

chizhang118 commented Jun 21, 2024 •

edited

Loading

chizhang118 commented Jun 27, 2024 •

edited

Loading