-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[Prefix Cache] Use LoRA name for consistent KV-cache block hashing #27211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request is a great improvement for ensuring deterministic KV-cache block hashing when using LoRA adapters. Replacing the non-deterministic integer ID with the LoRA name is a solid approach to enable reliable cache sharing across different vLLM instances. The added tests and performance profiling are appreciated and confirm the correctness and low overhead of the change.
I've added a couple of suggestions to make the implementation more robust by handling empty LoRA names, which could otherwise lead to cache collisions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution!
…llm-project#27211) Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
…llm-project#27211) Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
…o step_forward * 'step_forward' of https://github.com/raindaywhu/vllm: (148 commits) [Model] Add MoE support for NemotronH (vllm-project#25863) [Metrics] [KVConnector] Add connector prefix cache hit rate stats (vllm-project#26245) [CI] Reorganize entrypoints tests (vllm-project#27403) add SLA information into comparison graph for vLLM Benchmark Suite (vllm-project#25525) [CI/Build] Fix AMD CI: test_cpu_gpu.py (vllm-project#27388) [Bugfix] Fix args settings for guided decoding args (vllm-project#27375) [CI/Build] Fix Prithvi plugin test (vllm-project#27393) [Chore] Remove duplicate `has_` functions in vllm.utils (vllm-project#27372) [Model] Add num_cached_tokens for PoolingRequestOutput (vllm-project#27378) [V1][spec decode] return logprobs for spec decoding (vllm-project#26060) [CORE] Support Prefix Caching with Prompt Embeds (vllm-project#27219) [Bugfix][Core] running queue index leakage exception (vllm-project#26754) [Bugfix] Fix incorrect kv cache metrics in grafana.json (vllm-project#27133) [Bugfix] Fix SLA tuner initialization (vllm-project#27355) [Bugfix] Fix deepseek-ocr multi-image inference and add `merge_by_field_config=True` with tensor schema support (vllm-project#27361) [MLA] Bump FlashMLA (vllm-project#27354) [Chore] Separate out system utilities from vllm.utils (vllm-project#27201) [BugFix] bugfix for Flash Attention MLA with full cuda graph IMA following pr-25490 (vllm-project#27128) [Feature] publisher default set zmq in kv_event config (vllm-project#26915) [Prefix Cache] Use LoRA name for consistent KV-cache block hashing (vllm-project#27211) ...
…llm-project#27211) Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
…llm-project#27211) Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
…llm-project#27211) Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Purpose
When loading a new LoRA adapter, vLLM assigns it a unique LoRA integer ID using an atomic counter (ref 1, ref 2).
This ID is then included in the KV-Cache block hash along with the block tokens and other keys.
However, since the LoRA integer ID depends on the registration order, the resulting hashes are inconsistent across runs or instances — making it impossible to deterministically identify or share KV-Cache blocks between different vLLM instances.
This PR replaces the integer ID with the LoRA name in the hash calculation, making KV-Cache hashing consistent across instances and allowing reliable cache lookups, routing, and sharing.
Test Plan
Test Result
All updated tests pass.
Profiling
Performance testing shows negligible overhead when using LoRA names instead of integer IDs for KV-cache block hashing.
block size: 16num blocks: 3125total tokens per run: 3,125 × 16 = 50,000 tokens=== System Information === Platform: macOS-15.6.1-arm64-arm-64bit Processor: arm Python version: 3.12.11 CPU count: 10 RAM: 64.0 GB ========================= === LoRA Key Type Profiling Summary === LoRA requests processed per run: 3,125 Profiling config: 1000 runs, 3125 requests/run, block_size=16 --------------------------------------- lora_string: mean=0.0010s, std=0.0000s Mean time per LoRA request: 0.00000031s lora_int: mean=0.0010s, std=0.0000s Mean time per LoRA request: 0.00000031s --------------------------------------- Comparison (relative performance): String names are 1.02x slower than int IDs Overhead: +0.0000s per 3,125 requests (+0.00000001s per request) =======================================code: https://gist.github.com/sagiahrac/dfa26f54f0514fbf8e1c7a99527cfb8b