Skip to content

[BugFix] resource_manager_v1 lock PD#5616

Merged
Jiang-Jia-Jun merged 12 commits into
PaddlePaddle:developfrom
ST-XX:feature/pd
Jan 8, 2026
Merged

[BugFix] resource_manager_v1 lock PD#5616
Jiang-Jia-Jun merged 12 commits into
PaddlePaddle:developfrom
ST-XX:feature/pd

Conversation

@ST-XX
Copy link
Copy Markdown
Collaborator

@ST-XX ST-XX commented Dec 17, 2025

Motivation

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

PD分离模式,resource_manager_v1 部分相关函数未添加锁

Usage or Command

PD

Accuracy Tests

测试功能正常
bash -x start_v0_tp1.sh
bash start_v1_tp1_vl.sh

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented Dec 17, 2025

Thanks for your contribution!

@paddle-bot paddle-bot Bot added the contributor External developers label Dec 17, 2025
@ST-XX ST-XX requested a review from juncaipeng December 17, 2025 08:57
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Dec 17, 2025

Codecov Report

❌ Patch coverage is 0% with 20 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@78adf83). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/engine/sched/resource_manager_v1.py 0.00% 20 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #5616   +/-   ##
==========================================
  Coverage           ?   66.99%           
==========================================
  Files              ?      347           
  Lines              ?    44454           
  Branches           ?     6831           
==========================================
  Hits               ?    29781           
  Misses             ?    12470           
  Partials           ?     2203           
Flag Coverage Δ
GPU 66.99% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

juncaipeng
juncaipeng previously approved these changes Dec 17, 2025
Copy link
Copy Markdown
Collaborator

@juncaipeng juncaipeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Jiang-Jia-Jun
Jiang-Jia-Jun previously approved these changes Dec 26, 2025
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a critical thread-safety issue in the resource_manager_v1 module for PD (Prefill-Decode) disaggregation mode. The fix adds missing lock protection to two functions that manage prefilled requests in the decode instance.

Key changes:

  • Added lock protection to has_resource_for_prefilled_req() method
  • Added lock protection to add_prefilled_request() method with proper early return handling

Comment on lines 1120 to 1127
self.lock.acquire()
assert self.config.scheduler_config.splitwise_role == "decode", "Only D instance can call this method"
if request_output.request_id not in self.requests:
llm_logger.error(f"Request {request_output.request_id} not found in requests")
self.lock.release()
return
request = self.requests[request_output.request_id]

Copy link

Copilot AI Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using explicit lock.acquire()/lock.release() pattern is error-prone. If an exception occurs between lines 1120-1142 (before the lock.release() at line 1142), the lock will remain held indefinitely, causing a deadlock. Use a context manager pattern (with self.lock:) instead, as done in has_resource_for_prefilled_req(), to ensure the lock is always released even in exception scenarios.

Copilot uses AI. Check for mistakes.
@Jiang-Jia-Jun Jiang-Jia-Jun merged commit d8c6ba6 into PaddlePaddle:develop Jan 8, 2026
18 of 20 checks passed
chang-wenbin pushed a commit to chang-wenbin/FastDeploy that referenced this pull request Mar 2, 2026
* bugfix resource_manager_v1 lock PD

* with lock add_prefilled_request

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
xiaoguoguo626807 pushed a commit to xiaoguoguo626807/FastDeploy that referenced this pull request May 7, 2026
* bugfix resource_manager_v1 lock PD

* with lock add_prefilled_request

---------

Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants