Skip to content

Conversation

@sagiahrac
Copy link
Contributor

@sagiahrac sagiahrac commented Oct 20, 2025

Purpose

When loading a new LoRA adapter, vLLM assigns it a unique LoRA integer ID using an atomic counter (ref 1, ref 2).
This ID is then included in the KV-Cache block hash along with the block tokens and other keys.
However, since the LoRA integer ID depends on the registration order, the resulting hashes are inconsistent across runs or instances — making it impossible to deterministically identify or share KV-Cache blocks between different vLLM instances.

This PR replaces the integer ID with the LoRA name in the hash calculation, making KV-Cache hashing consistent across instances and allowing reliable cache lookups, routing, and sharing.

Test Plan

  • Test LoRA name inclusion: Checks that when a LoRA request is active, the LoRA name (not ID) appears in the extra keys used for hashing.
  • Verified that all existing LoRA and base-model inference tests still pass.

Test Result

All updated tests pass.

Profiling

Performance testing shows negligible overhead when using LoRA names instead of integer IDs for KV-cache block hashing.

  • block size: 16
  • num blocks: 3125
  • total tokens per run: 3,125 × 16 = 50,000 tokens
=== System Information ===
Platform: macOS-15.6.1-arm64-arm-64bit
Processor: arm
Python version: 3.12.11
CPU count: 10
RAM: 64.0 GB
=========================


=== LoRA Key Type Profiling Summary ===
LoRA requests processed per run: 3,125
Profiling config: 1000 runs, 3125 requests/run, block_size=16
---------------------------------------
lora_string: mean=0.0010s, std=0.0000s
    Mean time per LoRA request: 0.00000031s
lora_int: mean=0.0010s, std=0.0000s
    Mean time per LoRA request: 0.00000031s
---------------------------------------
Comparison (relative performance):
    String names are 1.02x slower than int IDs
    Overhead: +0.0000s per 3,125 requests (+0.00000001s per request)
=======================================

code: https://gist.github.com/sagiahrac/dfa26f54f0514fbf8e1c7a99527cfb8b

Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify mergify bot added the v1 label Oct 20, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a great improvement for ensuring deterministic KV-cache block hashing when using LoRA adapters. Replacing the non-deterministic integer ID with the LoRA name is a solid approach to enable reliable cache sharing across different vLLM instances. The added tests and performance profiling are appreciated and confirm the correctness and low overhead of the change.

I've added a couple of suggestions to make the implementation more robust by handling empty LoRA names, which could otherwise lead to cache collisions.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Member

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

@zhuohan123 zhuohan123 enabled auto-merge (squash) October 21, 2025 00:56
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 21, 2025
@zhuohan123 zhuohan123 merged commit 1651003 into vllm-project:main Oct 22, 2025
46 checks passed
usberkeley pushed a commit to usberkeley/vllm that referenced this pull request Oct 23, 2025
albertoperdomo2 pushed a commit to albertoperdomo2/vllm that referenced this pull request Oct 23, 2025
…llm-project#27211)

Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
845473182 pushed a commit to raindaywhu/vllm that referenced this pull request Oct 24, 2025
…o step_forward

* 'step_forward' of https://github.com/raindaywhu/vllm: (148 commits)
  [Model] Add MoE support for NemotronH (vllm-project#25863)
  [Metrics] [KVConnector] Add connector prefix cache hit rate stats (vllm-project#26245)
  [CI] Reorganize entrypoints tests (vllm-project#27403)
  add SLA information into comparison graph for vLLM Benchmark Suite (vllm-project#25525)
  [CI/Build] Fix AMD CI: test_cpu_gpu.py (vllm-project#27388)
  [Bugfix] Fix args settings for guided decoding args (vllm-project#27375)
  [CI/Build] Fix Prithvi plugin test (vllm-project#27393)
  [Chore] Remove duplicate `has_` functions in vllm.utils (vllm-project#27372)
  [Model] Add num_cached_tokens for PoolingRequestOutput (vllm-project#27378)
  [V1][spec decode] return logprobs for spec decoding (vllm-project#26060)
  [CORE] Support Prefix Caching with Prompt Embeds (vllm-project#27219)
  [Bugfix][Core] running queue index leakage exception (vllm-project#26754)
  [Bugfix] Fix incorrect kv cache metrics in grafana.json (vllm-project#27133)
  [Bugfix] Fix SLA tuner initialization (vllm-project#27355)
  [Bugfix] Fix deepseek-ocr multi-image inference and add `merge_by_field_config=True` with tensor schema support (vllm-project#27361)
  [MLA] Bump FlashMLA (vllm-project#27354)
  [Chore] Separate out system utilities from vllm.utils (vllm-project#27201)
  [BugFix] bugfix for Flash Attention MLA with full cuda graph IMA following pr-25490 (vllm-project#27128)
  [Feature] publisher default set zmq in kv_event config (vllm-project#26915)
  [Prefix Cache] Use LoRA name for consistent KV-cache block hashing (vllm-project#27211)
  ...
kingsmad pushed a commit to kingsmad/vllm that referenced this pull request Oct 25, 2025
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…llm-project#27211)

Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…llm-project#27211)

Signed-off-by: Sage Ahrac <sagiahrak@gmail.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants