Skip to content

[P/D] Avoid stranding blocks in P when aborted in D's waiting queue #19223

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 25, 2025

Conversation

njhill
Copy link
Member

@njhill njhill commented Jun 5, 2025

We currently have an exposure in the nixl P/D disagg implementation whereby allocated blocks can easily be stranded on the P worker when requests are aborted.

Whether things are cleaned up properly depends on where the request is in its P/D lifecycle when the abort (client disconnect) occurs.

  1. P instance receives request and performs prefill
  2. P instance returns kv transfer params to router/proxy
  3. D instance receives request and it goes into waiting queue
  4. D instance allocates blocks for request and transfers from P instance
  5. P instance frees blocks after transfer, D instance continues to generate tokens

Things are OK if the abort happens before (2) or after (4). The biggest exposure is during (3), which is what this PR addresses (i.e. 3 onwards is now watertight). We still have an exposure between (2) and (3), which will be addressed in subsequent PRs.

Changes

  • If request is aborted in D instance before it has been scheduled, ensure the free-blocks D->P notification still happens on the worker side.

Note that this PR depends on #18485 to work properly.

@njhill njhill added the bug Something isn't working label Jun 5, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @njhill, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

Hello! Gemini here, ready to provide a summary of this pull request. This PR addresses a specific issue in the nixl P/D (Prefill/Decode) disaggregated implementation where allocated blocks on the Prefill (P) worker could become stranded and not freed correctly. This happens when a request is aborted on the Decode (D) worker before the block allocation and transfer from P to D has completed. Specifically, this change targets the scenario where the abort occurs while the request is in the D instance's waiting queue. By ensuring that the D instance triggers the necessary cleanup path even in this early abort state, the P instance is correctly notified to free the blocks, preventing the stranding issue for this particular case. The PR description notes that there is still a remaining exposure between the P instance returning KV transfer parameters and the D instance receiving the request, which will be addressed in future work. This PR also depends on changes in pull request #18485.

Highlights

  • Fix Block Stranding: Resolves an issue in the P/D disaggregated setup where blocks allocated on the Prefill (P) worker could be stranded if a request is aborted on the Decode (D) worker before block transfer.
  • Handle Aborts in D's Waiting Queue: Specifically targets the scenario where a request is aborted while it is in the Decode (D) instance's waiting queue, ensuring proper cleanup notification to the P instance.
  • Trigger Worker-Side Cleanup: Modifies the request_finished logic on the D side to add aborted requests (in the pre-transfer state) to the _reqs_need_recv list, triggering the worker's mechanism to notify the P instance to free blocks.

Changelog

  • vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py
    • Added logic within the request_finished method to detect when a D-side request with do_remote_prefill is aborted before block transfer (around lines 292-303).
    • For such aborted requests, the request is added to _reqs_need_recv with an empty block list, ensuring the worker-side cleanup logic is triggered to notify the P instance to free blocks.
    • Adjusted conditional checks to accommodate the new abort handling logic.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


State held far away,
Aborted request takes flight,
Free the blocks now, please.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses a potential block stranding issue in the P/D disaggregated KV cache implementation. The core logic change in request_finished to handle scenarios where a request with do_remote_prefill=True finishes on the D-side before the prefill is processed appears sound. By queuing the request to _reqs_need_recv with empty block IDs, it correctly signals the P-instance to clean up its allocated blocks.

The addition of an early if not params: check also improves robustness.

Overall, the change is clear, concise, and seems to target the described problem well. My main suggestion revolves around aligning the inline comment with the implemented logic's generality.

Summary of Findings

  • Clarity of intent for prefill cleanup logic: The implemented code for cleaning up P-side prefill blocks is general (triggers if do_remote_prefill is true and the request finishes for any reason on D-side). However, the associated comment and PR description specifically mention 'aborted' requests. It would be beneficial to align the comment with the code's general behavior or, if the intent is strictly for aborts, to make the code condition more specific.

Merge Readiness

The pull request introduces a valuable fix for potential block stranding. The core logic appears sound. There is one medium-severity comment regarding the alignment of an inline comment with the implemented logic's generality, which would be good to address for clarity. Assuming the dependent PR (#18485) correctly processes the _reqs_need_recv queue, this change should work as intended. I am unable to approve pull requests, so please ensure further review and approval from other maintainers before merging.

Comment on lines 295 to 303
if params.get("do_remote_prefill"):
# Decode-side request has been aborted before we've allocated
# blocks locally and transferred from or notified the prefill
# instance.
# Add empty block_ids to _reqs_need_recv so that our worker side
# will notify and free blocks in the prefill instance.
self._reqs_need_recv[request.request_id] = (request, [])
params["do_remote_prefill"] = False
return False, None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic within this if params.get("do_remote_prefill"): block seems robust for handling cases where P-side blocks might be stranded. If do_remote_prefill is true when request_finished is called on the D-side, it implies that the P-instance has allocated blocks for prefill, but the D-instance has not yet completed its part of processing these blocks (e.g., allocating its own blocks and transferring the KV cache from P).

In such a state, if the request finishes on the D-side for any reason (aborted, stopped, length_capped, etc.), the P-side blocks would indeed be stranded if not explicitly cleaned up. This code correctly triggers that cleanup by adding the request to _reqs_need_recv.

The current implementation handles this generally, without explicitly checking if request.status == RequestStatus.FINISHED_ABORTED.

However, the inline comment (lines 296-298) and the PR description state that this change is specifically for when the "Decode-side request has been aborted".

Could you clarify if the intent is for this block to only execute for aborted requests, or if the current more general behavior (triggering for any finish reason if do_remote_prefill is true) is intended for broader robustness against stranded blocks?

  1. If the current general behavior is intended (which seems more robust for preventing stranded blocks), perhaps the comment could be updated to reflect this generality? For example: "Decode-side request has finished before prefill blocks were processed locally..."
  2. If this logic is strictly meant for FINISHED_ABORTED cases as per the PR description, would it be clearer to add and request.status == RequestStatus.FINISHED_ABORTED to the condition on line 295?

My current understanding is that the general approach (as implemented) is likely correct and more robust, in which case only the comment might need a slight adjustment for clarity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the abort case is the only way that this would be possible. If the request finishes for any other reason, it must have at least been scheduled and performed a forward pass, which means we will already have added it to the connector metadata so that the worker side will send the free or transfer nixl notification to the P worker.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will the bot respond too?

Copy link

github-actions bot commented Jun 5, 2025

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very elegant way to detect and handle the disconnect, nice job!

Comment on lines 295 to 303
if params.get("do_remote_prefill"):
# Decode-side request has been aborted before we've allocated
# blocks locally and transferred from or notified the prefill
# instance.
# Add empty block_ids to _reqs_need_recv so that our worker side
# will notify and free blocks in the prefill instance.
self._reqs_need_recv[request.request_id] = (request, [])
params["do_remote_prefill"] = False
return False, None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will the bot respond too?

@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 25, 2025
@njhill njhill merged commit 55c65ab into vllm-project:main Jun 25, 2025
78 checks passed
@njhill njhill deleted the fix-pd-waiting-abort branch June 25, 2025 22:19
m-misiura pushed a commit to m-misiura/vllm that referenced this pull request Jun 26, 2025
gmarinho2 pushed a commit to gmarinho2/vllm that referenced this pull request Jun 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants