[KV offload][4/N] Offloading KV connector #22595

orozery · 2025-08-10T14:14:16Z

This PR adds an offloading connector that delegates to a generic API introduced in #19848.
The actual implementation of this API is built using a factory which is currently empty.
A follow-up small PR will register a CPU implementation based on #20075 (scheduler-side implementation) and #21448 (worker-side implementation).

Part of RFC #19854.
Depends on PRs #19728, #19848, #19737.

github-actions · 2025-08-10T14:14:24Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This PR introduces a new offloading connector. The implementation is extensive and adds a lot of new components. My review found several critical issues that need to be addressed. These include a race condition in the tests, a critical assertion that would crash workers on transfer failures, a resource leak due to unjoined threads, and an incorrect list slicing that would lead to errors. These issues affect both the correctness of the new feature and the reliability of its tests.

tests/v1/kv_connector/unit/test_offloading_connector.py

vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py

vllm/v1/offloading/worker/worker.py

KuntaiDu · 2025-08-11T05:33:56Z

mark, will take a look and review after this PR gets stable.

mergify · 2025-08-14T12:39:26Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @orozery.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

ApostaC · 2025-08-19T03:59:06Z

@orozery Hey, thanks for the amazing work!

Is there a centralized branch for us to run some benchmarks? We are excited to test it and would like to push for landing this connector-based CPU offloading solution if it has good performance 🚀.

ZrBac · 2025-08-22T01:57:43Z

@orozery Hey, thanks for the amazing work!

Is there a centralized branch for us to run some benchmarks? We are excited to test it and would like to push for landing this connector-based CPU offloading solution if it has good performance 🚀.

try this branch, https://github.com/orozery/vllm/tree/cpu-offloading-afa5b7

ApostaC

@orozery Thanks for the great effort! Some high-level comments:

The current implementation is a bit over-complicated. We should simplify the transfer_fn and the LoadStoreSpec abstraction in order to get better performance and better maintainability.
There are a few potential performance improvements that we can do (immediately or as potential follow-ups)
(a) Launch the d2h/h2d copy kernels in a separate cuda stream
(b) Use cuda events to implement the async loading so that we don't need to launch extra python threads in the worker process.

vllm/v1/offloading/abstract.py

vllm/v1/offloading/mediums.py

vllm/v1/offloading/spec.py

vllm/v1/offloading/worker/worker.py

orozery · 2025-09-02T14:07:38Z

@ApostaC regarding performance:
With my newest changes (not yet pushed), running @KuntaiDu benchmark mentioned here:

Building LLM with LMCache: False with IBM: True
--------------------------------------------------
Batch size: 8
Average time for put: 7.0790 seconds
[7.0743, 7.1795, 7.0148, 7.1048, 7.0217]
--------------------------------------------------
Batch size: 8
Average time for get: 2.5238 seconds
[2.4561, 2.4778, 2.4658, 2.7351, 2.4842]


Building LLM with LMCache: True with IBM: False
--------------------------------------------------
Batch size: 8
Average time for put: 7.6079 seconds
[7.5944, 7.61, 7.6107, 7.6105, 7.6138]
--------------------------------------------------
Batch size: 8
Average time for get: 0.6232 seconds
[0.6272, 0.6199, 0.6224, 0.6243, 0.6223]


Building LLM with LMCache: False with IBM: False
--------------------------------------------------
Batch size: 8
Average time for put: 7.0416 seconds
[6.9824, 6.9939, 6.9981, 7.0069, 7.2269]
--------------------------------------------------
Batch size: 8
Average time for get: 7.0635 seconds
[7.0195, 7.032, 7.032, 7.0202, 7.214]

Adding some logging, I see that the times given by the offloading connector are completely dominated by the running time of the cuda transfer function swap_blocks.
The swap_blocks function uses cudaMemcpyAsync which utilizes the GPU's dma engine.
On the other hand, LMcache's transfer function is a multi-threaded kernel which heavily utilizes the GPU compute cores.
This is perfect the sync case where the GPU is idle, but I'm not sure a same kind of transfer function will be best for our case where we have model computations running in parallel on the GPU.
Your thoughts?

Two other thoughts:

I think that other optimizations suggested (like using cuda events, changing loadstorespec) will be negligible, given the performance weight of the transfer function. Nevertheless, as I wrote above, I'm open up for making these improvements.
We should test with higher QPS to check if we can actually see performance benefits of using async offloading.

pytorch-bot · 2025-09-15T15:33:22Z

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py

orozery · 2025-09-16T09:39:38Z

@njhill I moved all loggings to debug.
Still, I think it is useful to allow a level that only logs block hits and block offloading summaries.
Currently for the debug level I have a much more verbose loggings that actually print the load-store-specs.
So if you turn on debug, you get a lot of loggings (for actually debugging).
The problem I don't see any other levels I can use (for example, if there was a trace level, I could move my "real" debug-level to trace)

njhill · 2025-09-17T21:13:55Z

@njhill I moved all loggings to debug. Still, I think it is useful to allow a level that only logs block hits and block offloading summaries. Currently for the debug level I have a much more verbose loggings that actually print the load-store-specs. So if you turn on debug, you get a lot of loggings (for actually debugging). The problem I don't see any other levels I can use (for example, if there was a trace level, I could move my "real" debug-level to trace)

@orozery right I know what you mean. While it might not be the best approach, what we've done in other cases is exposed env vars to enable control of specific types of logs. So perhaps we could do this when a VLLM_LOG_KV_OFFLOAD_STATS=1 is set?

orozery · 2025-09-18T12:10:09Z

@orozery right I know what you mean. While it might not be the best approach, what we've done in other cases is exposed env vars to enable control of specific types of logs. So perhaps we could do this when a VLLM_LOG_KV_OFFLOAD_STATS=1 is set?

I prefer not to introduce a new env variable.
I think the better way to do it is actually not via logging but some metrics collection API.
Let's defer this for later.
BTW I think this is the only PR that you didn't approve so far.
PR #24251 (which was approved) actually depends on this one, so it should not be merged before this one.

njhill · 2025-09-18T14:39:27Z

I prefer not to introduce a new env variable.

It could also be an option in the kv connector specific config I guess, sounds good to defer though.

This commit introduces a new OffloadingConnector for offloading blocks of KV data via a generic interface. Signed-off-by: Or Ozeri <oro@il.ibm.com>

ApostaC

LGTM!

Signed-off-by: Or Ozeri <oro@il.ibm.com>

Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: charlifu <charlifu@amd.com>

Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Signed-off-by: Or Ozeri <oro@il.ibm.com>

sakunkun · 2025-10-24T08:07:41Z

vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py

+        self._request_block_ids.pop(req_id, None)
+        self._next_stored_block_idx.pop(req_id, None)
+
+        request_being_stored = req_id in self._reqs_being_stored


There is a lag in the status determination of Request's saved/sent state. During local debugging, I found that when a request finishes and the save operation is also completed, this method cannot immediately determine that the request has finished. This causes the blocks held by the Scheduler for this request in the current cycle to not be released immediately, and they need to be released in the next cycle through KVConnectorOutput's finished_sending.

I think this is actually expected. There will be memory leak when get_finished returns the request ID earlier than the scheduler calls request_finished.
This is because the scheduler expects the finished request ID to free the GPU KV cache after it calls request_finished. If the request ID is returned eariler, the GPU KV cache will never be freed

You can ake a look at my attempt in PR #27248. If I switch the connector to Offloading Connector, the reset_prefix_cache method fails to take effect because the KV blocks from the first request have not been released yet. The resources from the first request are only freed after the second request arrives and executes once.

Processed prompts: 100%|███| 1/1 [00:00<00:00, 2.19it/s, est. speed input: 13182.61 toks/s, output: 21.95 toks/s] -------------------------------------------------- Generated text: '...Hello, my name is...Hello, my' Generation took 0.48 seconds, first request done. -------------------------------------------------- WARNING 10-26 18:11:40 [v1/core/block_pool.py:366] Failed to reset prefix cache because some blocks (376) are not freed yet Adding requests: 100%|██████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 60.85it/s] Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]DEBUG 10-26 18:11:40 [v1/core/sched/scheduler.py:1291] Finished sending KV transfer for request 0 Processed prompts: 100%|███| 1/1 [00:00<00:00, 5.13it/s, est. speed input: 30844.40 toks/s, output: 51.35 toks/s] -------------------------------------------------- Generated text: ' about a person who is a doctor. I need' Generation took 0.21 seconds, second request done. --------------------------------------------------

Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

orozery requested review from WoosukKwon, alexm-redhat, comaniac, njhill, robertgshaw2-redhat and ywang96 as code owners August 10, 2025 14:14

mergify bot added ci/build v1 labels Aug 10, 2025

gemini-code-assist bot reviewed Aug 10, 2025

View reviewed changes

orozery force-pushed the offloading-connector branch 2 times, most recently from 4b24d03 to 4fca175 Compare August 10, 2025 14:49

orozery force-pushed the offloading-connector branch from 4fca175 to 8d7a0d7 Compare August 11, 2025 13:43

mergify bot added the documentation Improvements or additions to documentation label Aug 11, 2025

orozery force-pushed the offloading-connector branch 2 times, most recently from 866a51c to 4872976 Compare August 11, 2025 19:11

robertgshaw2-redhat mentioned this pull request Aug 13, 2025

[RFC]: KV cache offloading #19854

Open

1 task

mergify bot added the needs-rebase label Aug 14, 2025

orozery force-pushed the offloading-connector branch from 4872976 to 9d2e0b9 Compare August 14, 2025 12:40

mergify bot removed the needs-rebase label Aug 14, 2025

orozery force-pushed the offloading-connector branch from 9d2e0b9 to 11e1629 Compare August 14, 2025 12:50

ApostaC requested changes Aug 27, 2025

View reviewed changes

vMaroon mentioned this pull request Aug 27, 2025

vLLM Native CPU Offloading Connector llm-d/llm-d-kv-cache-manager#67

Open

orozery force-pushed the offloading-connector branch from 11e1629 to dc02cb5 Compare September 4, 2025 11:37

orozery mentioned this pull request Sep 4, 2025

[KV offload][5/N] Add CPUOffloadingSpec #24251

Merged

njhill reviewed Sep 16, 2025

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py Outdated Show resolved Hide resolved

vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py Outdated Show resolved Hide resolved

orozery force-pushed the offloading-connector branch 2 times, most recently from ef5a8bd to cfad5c9 Compare September 16, 2025 09:28

orozery force-pushed the offloading-connector branch from cfad5c9 to ed73f94 Compare September 17, 2025 09:33

mergify bot added the kv-connector label Sep 18, 2025

njhill approved these changes Sep 18, 2025

View reviewed changes

njhill changed the title ~~v1: Offloading connector~~ [KV offload][4/N] Offloading KV connector Sep 18, 2025

Introduce offloading connector

7685c3b

This commit introduces a new OffloadingConnector for offloading blocks of KV data via a generic interface. Signed-off-by: Or Ozeri <oro@il.ibm.com>

orozery force-pushed the offloading-connector branch from ed73f94 to 7685c3b Compare September 18, 2025 21:10

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 19, 2025

ApostaC approved these changes Sep 19, 2025

View reviewed changes

njhill enabled auto-merge (squash) September 19, 2025 17:05

njhill merged commit c59a0ec into vllm-project:main Sep 19, 2025
56 checks passed

debroy-rh pushed a commit to debroy-rh/vllm that referenced this pull request Sep 19, 2025

[KV offload][4/N] Offloading KV connector (vllm-project#22595)

d20bccf

Signed-off-by: Or Ozeri <oro@il.ibm.com>

FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025

[KV offload][4/N] Offloading KV connector (vllm-project#22595)

86594ee

Signed-off-by: Or Ozeri <oro@il.ibm.com>

charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025

[KV offload][4/N] Offloading KV connector (vllm-project#22595)

eae22a8

Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: charlifu <charlifu@amd.com>

HF-001 mentioned this pull request Sep 28, 2025

[RFC]: native kvcache offloading vllm-project/vllm-ascend#3241

Open

yewentao256 pushed a commit that referenced this pull request Oct 3, 2025

[KV offload][4/N] Offloading KV connector (#22595)

378c68b

Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025

[KV offload][4/N] Offloading KV connector (vllm-project#22595)

3d013c5

Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025

[KV offload][4/N] Offloading KV connector (vllm-project#22595)

34fa6e5

Signed-off-by: Or Ozeri <oro@il.ibm.com>

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[KV offload][4/N] Offloading KV connector (vllm-project#22595)

a41b916

Signed-off-by: Or Ozeri <oro@il.ibm.com>

sakunkun reviewed Oct 24, 2025

View reviewed changes

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025

[KV offload][4/N] Offloading KV connector (vllm-project#22595)

967b793

Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Uh oh!

[KV offload][4/N] Offloading KV connector #22595

[KV offload][4/N] Offloading KV connector #22595

Uh oh!

Conversation

orozery commented Aug 10, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 10, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KuntaiDu commented Aug 11, 2025

Uh oh!

mergify bot commented Aug 14, 2025

Uh oh!

ApostaC commented Aug 19, 2025

Uh oh!

ZrBac commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ApostaC left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

orozery commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 15, 2025

Uh oh!

Uh oh!

Uh oh!

orozery commented Sep 16, 2025

Uh oh!

njhill commented Sep 17, 2025

Uh oh!

orozery commented Sep 18, 2025

Uh oh!

njhill commented Sep 18, 2025

Uh oh!

ApostaC left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sakunkun Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

ApostaC Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

sakunkun Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

orozery commented Aug 10, 2025 •

edited by github-actions bot

Loading

ZrBac commented Aug 22, 2025 •

edited

Loading

orozery commented Sep 2, 2025 •

edited

Loading