-
-
Notifications
You must be signed in to change notification settings - Fork 8.4k
[P/D] Avoid stranding blocks in P when aborted in D's waiting queue #19223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Nick Hill <nhill@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @njhill, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
Hello! Gemini here, ready to provide a summary of this pull request. This PR addresses a specific issue in the nixl P/D (Prefill/Decode) disaggregated implementation where allocated blocks on the Prefill (P) worker could become stranded and not freed correctly. This happens when a request is aborted on the Decode (D) worker before the block allocation and transfer from P to D has completed. Specifically, this change targets the scenario where the abort occurs while the request is in the D instance's waiting queue. By ensuring that the D instance triggers the necessary cleanup path even in this early abort state, the P instance is correctly notified to free the blocks, preventing the stranding issue for this particular case. The PR description notes that there is still a remaining exposure between the P instance returning KV transfer parameters and the D instance receiving the request, which will be addressed in future work. This PR also depends on changes in pull request #18485.
Highlights
- Fix Block Stranding: Resolves an issue in the P/D disaggregated setup where blocks allocated on the Prefill (P) worker could be stranded if a request is aborted on the Decode (D) worker before block transfer.
- Handle Aborts in D's Waiting Queue: Specifically targets the scenario where a request is aborted while it is in the Decode (D) instance's waiting queue, ensuring proper cleanup notification to the P instance.
- Trigger Worker-Side Cleanup: Modifies the
request_finished
logic on the D side to add aborted requests (in the pre-transfer state) to the_reqs_need_recv
list, triggering the worker's mechanism to notify the P instance to free blocks.
Changelog
- vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py
- Added logic within the
request_finished
method to detect when a D-side request withdo_remote_prefill
is aborted before block transfer (around lines 292-303). - For such aborted requests, the request is added to
_reqs_need_recv
with an empty block list, ensuring the worker-side cleanup logic is triggered to notify the P instance to free blocks. - Adjusted conditional checks to accommodate the new abort handling logic.
- Added logic within the
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
State held far away,
Aborted request takes flight,
Free the blocks now, please.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses a potential block stranding issue in the P/D disaggregated KV cache implementation. The core logic change in request_finished
to handle scenarios where a request with do_remote_prefill=True
finishes on the D-side before the prefill is processed appears sound. By queuing the request to _reqs_need_recv
with empty block IDs, it correctly signals the P-instance to clean up its allocated blocks.
The addition of an early if not params:
check also improves robustness.
Overall, the change is clear, concise, and seems to target the described problem well. My main suggestion revolves around aligning the inline comment with the implemented logic's generality.
Summary of Findings
- Clarity of intent for prefill cleanup logic: The implemented code for cleaning up P-side prefill blocks is general (triggers if
do_remote_prefill
is true and the request finishes for any reason on D-side). However, the associated comment and PR description specifically mention 'aborted' requests. It would be beneficial to align the comment with the code's general behavior or, if the intent is strictly for aborts, to make the code condition more specific.
Merge Readiness
The pull request introduces a valuable fix for potential block stranding. The core logic appears sound. There is one medium-severity comment regarding the alignment of an inline comment with the implemented logic's generality, which would be good to address for clarity. Assuming the dependent PR (#18485) correctly processes the _reqs_need_recv
queue, this change should work as intended. I am unable to approve pull requests, so please ensure further review and approval from other maintainers before merging.
if params.get("do_remote_prefill"): | ||
# Decode-side request has been aborted before we've allocated | ||
# blocks locally and transferred from or notified the prefill | ||
# instance. | ||
# Add empty block_ids to _reqs_need_recv so that our worker side | ||
# will notify and free blocks in the prefill instance. | ||
self._reqs_need_recv[request.request_id] = (request, []) | ||
params["do_remote_prefill"] = False | ||
return False, None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic within this if params.get("do_remote_prefill"):
block seems robust for handling cases where P-side blocks might be stranded. If do_remote_prefill
is true when request_finished
is called on the D-side, it implies that the P-instance has allocated blocks for prefill, but the D-instance has not yet completed its part of processing these blocks (e.g., allocating its own blocks and transferring the KV cache from P).
In such a state, if the request finishes on the D-side for any reason (aborted, stopped, length_capped, etc.), the P-side blocks would indeed be stranded if not explicitly cleaned up. This code correctly triggers that cleanup by adding the request to _reqs_need_recv
.
The current implementation handles this generally, without explicitly checking if request.status == RequestStatus.FINISHED_ABORTED
.
However, the inline comment (lines 296-298) and the PR description state that this change is specifically for when the "Decode-side request has been aborted".
Could you clarify if the intent is for this block to only execute for aborted requests, or if the current more general behavior (triggering for any finish reason if do_remote_prefill
is true) is intended for broader robustness against stranded blocks?
- If the current general behavior is intended (which seems more robust for preventing stranded blocks), perhaps the comment could be updated to reflect this generality? For example: "Decode-side request has finished before prefill blocks were processed locally..."
- If this logic is strictly meant for
FINISHED_ABORTED
cases as per the PR description, would it be clearer to addand request.status == RequestStatus.FINISHED_ABORTED
to the condition on line 295?
My current understanding is that the general approach (as implemented) is likely correct and more robust, in which case only the comment might need a slight adjustment for clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well the abort case is the only way that this would be possible. If the request finishes for any other reason, it must have at least been scheduled and performed a forward pass, which means we will already have added it to the connector metadata so that the worker side will send the free or transfer nixl notification to the P worker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will the bot respond too?
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very elegant way to detect and handle the disconnect, nice job!
if params.get("do_remote_prefill"): | ||
# Decode-side request has been aborted before we've allocated | ||
# blocks locally and transferred from or notified the prefill | ||
# instance. | ||
# Add empty block_ids to _reqs_need_recv so that our worker side | ||
# will notify and free blocks in the prefill instance. | ||
self._reqs_need_recv[request.request_id] = (request, []) | ||
params["do_remote_prefill"] = False | ||
return False, None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will the bot respond too?
Signed-off-by: Nick Hill <nhill@redhat.com>
…llm-project#19223) Signed-off-by: Nick Hill <nhill@redhat.com>
…llm-project#19223) Signed-off-by: Nick Hill <nhill@redhat.com>
We currently have an exposure in the nixl P/D disagg implementation whereby allocated blocks can easily be stranded on the P worker when requests are aborted.
Whether things are cleaned up properly depends on where the request is in its P/D lifecycle when the abort (client disconnect) occurs.
Things are OK if the abort happens before (2) or after (4). The biggest exposure is during (3), which is what this PR addresses (i.e. 3 onwards is now watertight). We still have an exposure between (2) and (3), which will be addressed in subsequent PRs.
Changes
Note that this PR depends on #18485 to work properly.