-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[KV offload][4/N] Offloading KV connector #22595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This PR introduces a new offloading connector. The implementation is extensive and adds a lot of new components. My review found several critical issues that need to be addressed. These include a race condition in the tests, a critical assertion that would crash workers on transfer failures, a resource leak due to unjoined threads, and an incorrect list slicing that would lead to errors. These issues affect both the correctness of the new feature and the reliability of its tests.
vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py
Outdated
Show resolved
Hide resolved
vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py
Outdated
Show resolved
Hide resolved
4b24d03 to
4fca175
Compare
|
mark, will take a look and review after this PR gets stable. |
4fca175 to
8d7a0d7
Compare
866a51c to
4872976
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
4872976 to
9d2e0b9
Compare
9d2e0b9 to
11e1629
Compare
|
@orozery Hey, thanks for the amazing work! Is there a centralized branch for us to run some benchmarks? We are excited to test it and would like to push for landing this connector-based CPU offloading solution if it has good performance 🚀. |
try this branch, https://github.com/orozery/vllm/tree/cpu-offloading-afa5b7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@orozery Thanks for the great effort! Some high-level comments:
- The current implementation is a bit over-complicated. We should simplify the
transfer_fnand theLoadStoreSpecabstraction in order to get better performance and better maintainability. - There are a few potential performance improvements that we can do (immediately or as potential follow-ups)
(a) Launch the d2h/h2d copy kernels in a separate cuda stream
(b) Use cuda events to implement the async loading so that we don't need to launch extra python threads in the worker process.
|
@ApostaC regarding performance: Adding some logging, I see that the times given by the offloading connector are completely dominated by the running time of the cuda transfer function Two other thoughts:
|
11e1629 to
dc02cb5
Compare
|
No ciflow labels are configured for this repo. |
vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py
Outdated
Show resolved
Hide resolved
vllm/distributed/kv_transfer/kv_connector/v1/offloading_connector.py
Outdated
Show resolved
Hide resolved
ef5a8bd to
cfad5c9
Compare
|
@njhill I moved all loggings to debug. |
cfad5c9 to
ed73f94
Compare
@orozery right I know what you mean. While it might not be the best approach, what we've done in other cases is exposed env vars to enable control of specific types of logs. So perhaps we could do this when a |
I prefer not to introduce a new env variable. |
It could also be an option in the kv connector specific config I guess, sounds good to defer though. |
This commit introduces a new OffloadingConnector for offloading blocks of KV data via a generic interface. Signed-off-by: Or Ozeri <oro@il.ibm.com>
ed73f94 to
7685c3b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Signed-off-by: Or Ozeri <oro@il.ibm.com>
Signed-off-by: Or Ozeri <oro@il.ibm.com>
Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: Or Ozeri <oro@il.ibm.com>
Signed-off-by: Or Ozeri <oro@il.ibm.com>
| self._request_block_ids.pop(req_id, None) | ||
| self._next_stored_block_idx.pop(req_id, None) | ||
|
|
||
| request_being_stored = req_id in self._reqs_being_stored |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a lag in the status determination of Request's saved/sent state. During local debugging, I found that when a request finishes and the save operation is also completed, this method cannot immediately determine that the request has finished. This causes the blocks held by the Scheduler for this request in the current cycle to not be released immediately, and they need to be released in the next cycle through KVConnectorOutput's finished_sending.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is actually expected. There will be memory leak when get_finished returns the request ID earlier than the scheduler calls request_finished.
This is because the scheduler expects the finished request ID to free the GPU KV cache after it calls request_finished. If the request ID is returned eariler, the GPU KV cache will never be freed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can ake a look at my attempt in PR #27248. If I switch the connector to Offloading Connector, the reset_prefix_cache method fails to take effect because the KV blocks from the first request have not been released yet. The resources from the first request are only freed after the second request arrives and executes once.
Processed prompts: 100%|███| 1/1 [00:00<00:00, 2.19it/s, est. speed input: 13182.61 toks/s, output: 21.95 toks/s]
--------------------------------------------------
Generated text: '...Hello, my name is...Hello, my'
Generation took 0.48 seconds, first request done.
--------------------------------------------------
WARNING 10-26 18:11:40 [v1/core/block_pool.py:366] Failed to reset prefix cache because some blocks (376) are not freed yet
Adding requests: 100%|██████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 60.85it/s]
Processed prompts: 0%| | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]DEBUG 10-26 18:11:40 [v1/core/sched/scheduler.py:1291] Finished sending KV transfer for request 0
Processed prompts: 100%|███| 1/1 [00:00<00:00, 5.13it/s, est. speed input: 30844.40 toks/s, output: 51.35 toks/s]
--------------------------------------------------
Generated text: ' about a person who is a doctor. I need'
Generation took 0.21 seconds, second request done.
--------------------------------------------------Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
This PR adds an offloading connector that delegates to a generic API introduced in #19848.
The actual implementation of this API is built using a factory which is currently empty.
A follow-up small PR will register a CPU implementation based on #20075 (scheduler-side implementation) and #21448 (worker-side implementation).
Part of RFC #19854.
Depends on PRs #19728, #19848, #19737.