[BugFix] resource_manager_v1 lock PD#5616
Conversation
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #5616 +/- ##
==========================================
Coverage ? 66.99%
==========================================
Files ? 347
Lines ? 44454
Branches ? 6831
==========================================
Hits ? 29781
Misses ? 12470
Partials ? 2203
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR addresses a critical thread-safety issue in the resource_manager_v1 module for PD (Prefill-Decode) disaggregation mode. The fix adds missing lock protection to two functions that manage prefilled requests in the decode instance.
Key changes:
- Added lock protection to
has_resource_for_prefilled_req()method - Added lock protection to
add_prefilled_request()method with proper early return handling
| self.lock.acquire() | ||
| assert self.config.scheduler_config.splitwise_role == "decode", "Only D instance can call this method" | ||
| if request_output.request_id not in self.requests: | ||
| llm_logger.error(f"Request {request_output.request_id} not found in requests") | ||
| self.lock.release() | ||
| return | ||
| request = self.requests[request_output.request_id] | ||
|
|
There was a problem hiding this comment.
Using explicit lock.acquire()/lock.release() pattern is error-prone. If an exception occurs between lines 1120-1142 (before the lock.release() at line 1142), the lock will remain held indefinitely, causing a deadlock. Use a context manager pattern (with self.lock:) instead, as done in has_resource_for_prefilled_req(), to ensure the lock is always released even in exception scenarios.
* bugfix resource_manager_v1 lock PD * with lock add_prefilled_request --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com> Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
* bugfix resource_manager_v1 lock PD * with lock add_prefilled_request --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com> Co-authored-by: Jiang-Jia-Jun <163579578+Jiang-Jia-Jun@users.noreply.github.com>
Motivation
Modifications
PD分离模式,resource_manager_v1 部分相关函数未添加锁
Usage or Command
PD
Accuracy Tests
测试功能正常
bash -x start_v0_tp1.sh
bash start_v1_tp1_vl.sh
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.