KVCache Manager

Introduction

LLM inference can be computationally expensive due to the sequential nature of token generation. KV-caching plays a critical role in optimizing this process. By storing previously computed key and value attention vectors, KVCache reuse avoids redundant computations during inference, significantly reducing latency and resource consumption. This is particularly beneficial for long context multi-turn conversations or Agentic (&RAG) applications where previously computed information can be leveraged effectively. Efficient KVCache management and routing are essential for scaling LLM inference and delivering a responsive user experience.

llmd-kv-cache-manager is a pluggable KVCache Manager for KVCache Aware Routing in vLLM-based serving platforms.

See docs for more information on goals, architecture and more.

Overview

The code defines a KVCacheIndexer module that efficiently maintains a global view of KVCache states and localities. In the current state of vLLM, the only available information on KVCache availability is that of the offloaded tensors to KVCache Engines via the Connector API.

The kvcache.Indexer module is a pluggable Go package designed for use by orchestrators to enable KVCache-aware scheduling decisions.

graph 
  subgraph Cluster
    Router
    subgraph KVCacheManager[KVCache Manager]
      KVCacheIndexer[KVCache Indexer]
      PrefixStore[LRU Prefix Store]
      KVBlockToPodIndex[KVBlock to Pod availability Index]
    end
    subgraph vLLMNode[vLLM Node]
      vLLMCore[vLLM Core]
      KVCacheEngine["KVCache Engine (LMCache)"]
    end
    Redis
  end

  Router -->|"Score(prompt, ModelName, relevantPods)"| KVCacheIndexer
  KVCacheIndexer -->|"{Pod to Scores map}"| Router
  Router -->|Route| vLLMNode
  
  KVCacheIndexer -->|"FindLongestTokenizedPrefix(prompt, ModelName) -> tokens"| PrefixStore
  PrefixStore -->|"DigestPromptAsync"| PrefixStore
  KVCacheIndexer -->|"GetPodsForKeys(tokens) -> {KVBlock keys to Pods} availability map"| KVBlockToPodIndex
  KVBlockToPodIndex -->|"Redis MGet(blockKeys) -> {KVBlock keys to Pods}"| Redis

  vLLMCore -->|Connector API| KVCacheEngine
  KVCacheEngine -->|"UpdateIndex(KVBlock keys, nodeIP)"| Redis

This overview greatly simplifies the actual architecture and combines steps across several submodules. For a detailed architecture, refer to the architecture document.

Examples

KVCache Indexer:
- A reference implementation of using the kvcache.Indexer module.
KVCache Aware Scorer:
- A reference implementation of integrating the kvcache.Indexer module in llm-d-inference-scheduler in a KVCache aware scorer.

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
.github		.github
deploy		deploy
docs		docs
examples		examples
hack/boilerplate		hack/boilerplate
hooks		hooks
pkg		pkg
tests		tests
vllm-setup-helm		vllm-setup-helm
.gitignore		.gitignore
.golangci.yml		.golangci.yml
.licenserc.yaml		.licenserc.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
OWNERS		OWNERS
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

KVCache Manager

Introduction

Overview

Examples

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 5

Languages

License

llm-d/llm-d-kv-cache-manager

Folders and files

Latest commit

History

Repository files navigation

KVCache Manager

Introduction

Overview

Examples

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 5

Languages

Packages