[Optim] Robust sync status when preempted happens#5796
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Pull request overview
This PR fixes a synchronization issue that occurs during request preemption by introducing a dedicated PREEMPTED_TOKEN_ID (-9) to distinguish preemption synchronization signals from other uses of token_id -1. Previously, -1 was ambiguously used for both invalid slots and completed block allocations, which could cause synchronization problems when the token processor processes tokens slower than the engine generates them.
Key changes:
- Introduces
PREEMPTED_TOKEN_ID = -9constant to explicitly signal preemption completion - Adds
preempted_idxlist tracking in all model runners (GPU, XPU, Metax) to record which slots have been preempted - Updates token processor logic to specifically check for
PREEMPTED_TOKEN_IDbefore rescheduling preempted requests - Adds guard in
cache_output_tokensto only cache during decoding phase, preventing issues when requests are preempted during prefill
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| fastdeploy/config.py | Defines new PREEMPTED_TOKEN_ID = -9 constant for unambiguous preemption signaling |
| fastdeploy/worker/xpu_model_runner.py | Initializes and tracks preempted indices in share_inputs for XPU model runner |
| fastdeploy/worker/metax_model_runner.py | Initializes and tracks preempted indices in share_inputs for Metax model runner |
| fastdeploy/worker/gpu_model_runner.py | Initializes and tracks preempted indices in share_inputs for GPU model runner |
| fastdeploy/output/token_processor.py | Updates reschedule logic to check for specific PREEMPTED_TOKEN_ID; removes old batch-based reschedule method; adjusts token_id comparison from <= 0 to < 0 |
| fastdeploy/model_executor/pre_and_post_process.py | Sets PREEMPTED_TOKEN_ID for preempted slots after token generation and input updates |
| fastdeploy/engine/sched/resource_manager_v1.py | Adds condition to only cache output tokens during decoding phase, preventing issues during preemption in prefill |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #5796 +/- ##
==========================================
Coverage ? 67.30%
==========================================
Files ? 348
Lines ? 44769
Branches ? 6891
==========================================
Hits ? 30134
Misses ? 12409
Partials ? 2226
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…into fix_sync_status_when_preempted_happens
…ub.com/rainyfly/FastDeploy into fix_sync_status_when_preempted_happens
…into fix_sync_status_when_preempted_happens
…ub.com/rainyfly/FastDeploy into fix_sync_status_when_preempted_happens
* [Bug fix] Sync status for caching output cache * fix * fix * fix bug * fix * fix * support xpu * fix * fix * fix * fix * fix * fix ci * fix ci * fix xpu --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
* [Bug fix] Sync status for caching output cache * fix * fix * fix bug * fix * fix * support xpu * fix * fix * fix * fix * fix * fix ci * fix ci * fix xpu --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
* [Bug fix] Sync status for caching output cache * fix * fix * fix bug * fix * fix * support xpu * fix * fix * fix * fix * fix * fix ci * fix ci * fix xpu --------- Co-authored-by: YuBaoku <49938469+EmmonsCurse@users.noreply.github.com>
Motivation
背景
当抢占发生时,服务层需要在发生抢占的槽位和引擎做一次同步,确保该槽位引擎生成的 token 都已经接收完毕,才可以被调度回 waiting 队列进行重调度。
如果服务层发动了对请求的抢占,却没有和引擎做同步,就可能导致服务层将被抢占的请求调度回 prefill时候看到的 need prefill token 数是 N,而之后发给引擎时候又由于异步接收到了额外一个 token,最终引擎层看到的need prefill token 确是 N+1,这种不同步会导致该槽位的请求 hang 住。
为了解决这一同步问题,目前服务层的 token processor 通过 token_id -1来保证同步,逻辑是引擎在吐完有效 token 后,处理 preempted task 时会返回-1,代表之前该槽位生成的有效 token 都已经接收完毕,之后再把被抢占请求放回 waiting 队列以便被重调度。
但是实际上,生成-1 会代表两个含义:
用-1来做抢占时候的同步,在某些场景下可能会无法判断是情况 1 还是情况 2 所生成的-1 ,从而带来同步问题。
为了增强鲁棒性并且避免歧义,采用单独的 token_id 用做同步引擎对抢占的处理。
Modifications
None
Usage or Command
None
Accuracy Tests
None
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.