Skip to content

Conversation

@orozery
Copy link
Contributor

@orozery orozery commented Aug 10, 2025

This PR adds an offloading connector that delegates to a generic API introduced in #19848.
The actual implementation of this API is built using a factory which is currently empty.
A follow-up small PR will register a CPU implementation based on #20075 (scheduler-side implementation) and #21448 (worker-side implementation).

Part of RFC #19854.
Depends on PRs #19728, #19848, #19737.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR introduces a new offloading connector. The implementation is extensive and adds a lot of new components. My review found several critical issues that need to be addressed. These include a race condition in the tests, a critical assertion that would crash workers on transfer failures, a resource leak due to unjoined threads, and an incorrect list slicing that would lead to errors. These issues affect both the correctness of the new feature and the reliability of its tests.

@orozery orozery force-pushed the offloading-connector branch 2 times, most recently from 4b24d03 to 4fca175 Compare August 10, 2025 14:49
@KuntaiDu
Copy link
Collaborator

mark, will take a look and review after this PR gets stable.

@orozery orozery force-pushed the offloading-connector branch from 4fca175 to 8d7a0d7 Compare August 11, 2025 13:43
@mergify mergify bot added the documentation Improvements or additions to documentation label Aug 11, 2025
@orozery orozery force-pushed the offloading-connector branch 2 times, most recently from 866a51c to 4872976 Compare August 11, 2025 19:11
@mergify
Copy link

mergify bot commented Aug 14, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @orozery.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 14, 2025
@orozery orozery force-pushed the offloading-connector branch from 4872976 to 9d2e0b9 Compare August 14, 2025 12:40
@mergify mergify bot removed the needs-rebase label Aug 14, 2025
@orozery orozery force-pushed the offloading-connector branch from 9d2e0b9 to 11e1629 Compare August 14, 2025 12:50
@ApostaC
Copy link
Collaborator

ApostaC commented Aug 19, 2025

@orozery Hey, thanks for the amazing work!

Is there a centralized branch for us to run some benchmarks? We are excited to test it and would like to push for landing this connector-based CPU offloading solution if it has good performance 🚀.

@ZrBac
Copy link

ZrBac commented Aug 22, 2025

@orozery Hey, thanks for the amazing work!

Is there a centralized branch for us to run some benchmarks? We are excited to test it and would like to push for landing this connector-based CPU offloading solution if it has good performance 🚀.

try this branch, https://github.com/orozery/vllm/tree/cpu-offloading-afa5b7

Copy link
Collaborator

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@orozery Thanks for the great effort! Some high-level comments:

  1. The current implementation is a bit over-complicated. We should simplify the transfer_fn and the LoadStoreSpec abstraction in order to get better performance and better maintainability.
  2. There are a few potential performance improvements that we can do (immediately or as potential follow-ups)
    (a) Launch the d2h/h2d copy kernels in a separate cuda stream
    (b) Use cuda events to implement the async loading so that we don't need to launch extra python threads in the worker process.

@orozery
Copy link
Contributor Author

orozery commented Sep 2, 2025

@ApostaC regarding performance:
With my newest changes (not yet pushed), running @KuntaiDu benchmark mentioned here:

Building LLM with LMCache: False with IBM: True
--------------------------------------------------
Batch size: 8
Average time for put: 7.0790 seconds
[7.0743, 7.1795, 7.0148, 7.1048, 7.0217]
--------------------------------------------------
Batch size: 8
Average time for get: 2.5238 seconds
[2.4561, 2.4778, 2.4658, 2.7351, 2.4842]


Building LLM with LMCache: True with IBM: False
--------------------------------------------------
Batch size: 8
Average time for put: 7.6079 seconds
[7.5944, 7.61, 7.6107, 7.6105, 7.6138]
--------------------------------------------------
Batch size: 8
Average time for get: 0.6232 seconds
[0.6272, 0.6199, 0.6224, 0.6243, 0.6223]


Building LLM with LMCache: False with IBM: False
--------------------------------------------------
Batch size: 8
Average time for put: 7.0416 seconds
[6.9824, 6.9939, 6.9981, 7.0069, 7.2269]
--------------------------------------------------
Batch size: 8
Average time for get: 7.0635 seconds
[7.0195, 7.032, 7.032, 7.0202, 7.214]

Adding some logging, I see that the times given by the offloading connector are completely dominated by the running time of the cuda transfer function swap_blocks.
The swap_blocks function uses cudaMemcpyAsync which utilizes the GPU's dma engine.
On the other hand, LMcache's transfer function is a multi-threaded kernel which heavily utilizes the GPU compute cores.
This is perfect the sync case where the GPU is idle, but I'm not sure a same kind of transfer function will be best for our case where we have model computations running in parallel on the GPU.
Your thoughts?

Two other thoughts:

  1. I think that other optimizations suggested (like using cuda events, changing loadstorespec) will be negligible, given the performance weight of the transfer function. Nevertheless, as I wrote above, I'm open up for making these improvements.
  2. We should test with higher QPS to check if we can actually see performance benefits of using async offloading.

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 15, 2025

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

@orozery orozery force-pushed the offloading-connector branch 2 times, most recently from ef5a8bd to cfad5c9 Compare September 16, 2025 09:28
@orozery
Copy link
Contributor Author

orozery commented Sep 16, 2025

@njhill I moved all loggings to debug.
Still, I think it is useful to allow a level that only logs block hits and block offloading summaries.
Currently for the debug level I have a much more verbose loggings that actually print the load-store-specs.
So if you turn on debug, you get a lot of loggings (for actually debugging).
The problem I don't see any other levels I can use (for example, if there was a trace level, I could move my "real" debug-level to trace)

@orozery orozery force-pushed the offloading-connector branch from cfad5c9 to ed73f94 Compare September 17, 2025 09:33
@njhill
Copy link
Member

njhill commented Sep 17, 2025

@njhill I moved all loggings to debug. Still, I think it is useful to allow a level that only logs block hits and block offloading summaries. Currently for the debug level I have a much more verbose loggings that actually print the load-store-specs. So if you turn on debug, you get a lot of loggings (for actually debugging). The problem I don't see any other levels I can use (for example, if there was a trace level, I could move my "real" debug-level to trace)

@orozery right I know what you mean. While it might not be the best approach, what we've done in other cases is exposed env vars to enable control of specific types of logs. So perhaps we could do this when a VLLM_LOG_KV_OFFLOAD_STATS=1 is set?

@orozery
Copy link
Contributor Author

orozery commented Sep 18, 2025

@orozery right I know what you mean. While it might not be the best approach, what we've done in other cases is exposed env vars to enable control of specific types of logs. So perhaps we could do this when a VLLM_LOG_KV_OFFLOAD_STATS=1 is set?

I prefer not to introduce a new env variable.
I think the better way to do it is actually not via logging but some metrics collection API.
Let's defer this for later.
BTW I think this is the only PR that you didn't approve so far.
PR #24251 (which was approved) actually depends on this one, so it should not be merged before this one.

@mergify mergify bot added the kv-connector label Sep 18, 2025
@njhill
Copy link
Member

njhill commented Sep 18, 2025

I prefer not to introduce a new env variable.

It could also be an option in the kv connector specific config I guess, sounds good to defer though.

@njhill njhill changed the title v1: Offloading connector [KV offload][4/N] Offloading KV connector Sep 18, 2025
This commit introduces a new OffloadingConnector for
offloading blocks of KV data via a generic interface.

Signed-off-by: Or Ozeri <oro@il.ibm.com>
@orozery orozery force-pushed the offloading-connector branch from ed73f94 to 7685c3b Compare September 18, 2025 21:10
@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 19, 2025
Copy link
Collaborator

@ApostaC ApostaC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@njhill njhill enabled auto-merge (squash) September 19, 2025 17:05
@njhill njhill merged commit c59a0ec into vllm-project:main Sep 19, 2025
56 checks passed
debroy-rh pushed a commit to debroy-rh/vllm that referenced this pull request Sep 19, 2025
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
charlifu pushed a commit to ROCm/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: Or Ozeri <oro@il.ibm.com>
Signed-off-by: charlifu <charlifu@amd.com>
yewentao256 pushed a commit that referenced this pull request Oct 3, 2025
Signed-off-by: Or Ozeri <oro@il.ibm.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
Signed-off-by: Or Ozeri <oro@il.ibm.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
choprahetarth pushed a commit to Tandemn-Labs/vllm that referenced this pull request Oct 11, 2025
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
self._request_block_ids.pop(req_id, None)
self._next_stored_block_idx.pop(req_id, None)

request_being_stored = req_id in self._reqs_being_stored
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lag in the status determination of Request's saved/sent state. During local debugging, I found that when a request finishes and the save operation is also completed, this method cannot immediately determine that the request has finished. This causes the blocks held by the Scheduler for this request in the current cycle to not be released immediately, and they need to be released in the next cycle through KVConnectorOutput's finished_sending.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is actually expected. There will be memory leak when get_finished returns the request ID earlier than the scheduler calls request_finished.
This is because the scheduler expects the finished request ID to free the GPU KV cache after it calls request_finished. If the request ID is returned eariler, the GPU KV cache will never be freed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can ake a look at my attempt in PR #27248. If I switch the connector to Offloading Connector, the reset_prefix_cache method fails to take effect because the KV blocks from the first request have not been released yet. The resources from the first request are only freed after the second request arrives and executes once.

Processed prompts: 100%|███| 1/1 [00:00<00:00,  2.19it/s, est. speed input: 13182.61 toks/s, output: 21.95 toks/s]
--------------------------------------------------
Generated text: '...Hello, my name is...Hello, my'
Generation took 0.48 seconds, first request done.
--------------------------------------------------
WARNING 10-26 18:11:40 [v1/core/block_pool.py:366] Failed to reset prefix cache because some blocks (376) are not freed yet
Adding requests: 100%|██████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 60.85it/s]
Processed prompts:   0%|                | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]DEBUG 10-26 18:11:40 [v1/core/sched/scheduler.py:1291] Finished sending KV transfer for request 0
Processed prompts: 100%|███| 1/1 [00:00<00:00,  5.13it/s, est. speed input: 30844.40 toks/s, output: 51.35 toks/s]
--------------------------------------------------
Generated text: ' about a person who is a doctor. I need'
Generation took 0.21 seconds, second request done.
--------------------------------------------------

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
Signed-off-by: Or Ozeri <oro@il.ibm.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build documentation Improvements or additions to documentation kv-connector ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants