[Feature] [PD Disaggregation] simplify configuration for pd-disaggregated deployment, and refactor post-init and usage for all ports#5415
Conversation
…factor post-init and usage for all ports
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## develop #5415 +/- ##
==========================================
Coverage ? 62.08%
==========================================
Files ? 329
Lines ? 41287
Branches ? 6295
==========================================
Hits ? 25633
Misses ? 13701
Partials ? 1953
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR refactors port configuration for PD-disaggregated deployments by introducing automatic port allocation and consolidating port management logic. The changes aim to simplify configuration by removing the need for users to manually specify multiple ports per service.
Key Changes
- Introduced automatic port discovery via
find_free_ports()function with validation inpost_init_all_ports() - Centralized port extraction logic in
FDConfig.postprocess()to set per-DP ports (local_engine_worker_queue_port, etc.) - Added RDMA environment auto-detection via new
get_rdma_nics.shscript
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| fastdeploy/utils.py | Added parse_ports(), find_free_ports(), and modified is_port_available() for port management |
| fastdeploy/engine/args_utils.py | Added post_init_all_ports() to automatically allocate ports; changed default cache_transfer_protocol to "ipc,rdma" |
| fastdeploy/config.py | Moved port extraction logic to postprocess() method; added local_engine_worker_queue_port attribute |
| fastdeploy/worker/worker_process.py | Moved port assignment from pre-config to post-config initialization |
| fastdeploy/engine/engine.py | Updated to use local_engine_worker_queue_port for DP queue services |
| fastdeploy/engine/common_engine.py | Simplified port handling by using local_engine_worker_queue_port directly |
| fastdeploy/engine/expert_service.py | Removed local port/device ID extraction logic (now handled in config) |
| fastdeploy/splitwise/splitwise_connector.py | Changed to use scalar pd_comm_port and local_device_ids |
| fastdeploy/cache_manager/prefix_cache_manager.py | Renamed pid_suffix to ipc_suffix; updated RDMA port usage |
| fastdeploy/cache_manager/cache_transfer_manager.py | Renamed engine_pid to ipc_suffix for clarity |
| fastdeploy/cache_manager/cache_messager.py | Renamed engine_pid to ipc_suffix for consistency |
| fastdeploy/cache_manager/transfer_factory/rdma_cache_transfer.py | Added automatic RDMA environment setup and NIC detection |
| fastdeploy/cache_manager/transfer_factory/get_rdma_nics.sh | New script for detecting RDMA-capable network interfaces |
| examples/splitwise/utils.sh | Improved health check display with UDP port checking |
| examples/splitwise/start_v1_tp1.sh | Simplified example by removing manual port specifications |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 25 out of 25 changed files in this pull request and generated 16 comments.
Comments suppressed due to low confidence (1)
fastdeploy/engine/args_utils.py:557
- This comment appears to contain commented-out code.
# for port in ports:
# assert is_port_available("0.0.0.0", port), f"Parameter `{name}`:{port} is already in use."
…ated deployment, and refactor post-init and usage for all ports (PaddlePaddle#5415) * [feat] simplify configuration for pd-disaggregated deployment, and refactor post-init and usage for all ports * [fix] fix some bugs * [fix] fix rdma port for cache manager/messager * [fix] temporarily cancel port availability check to see if it can pass ci test * [feat] simplify args for multi api server * [fix] fix dp * [fix] fix port for xpu * [fix] add tests for ports post processing & fix ci * [test] fix test_multi_api_server * [fix] fix rdma_comm_ports args for multi_api_server * [fix] fix test_common_engine * [fix] fix test_cache_transfer_manager * [chore] automatically setting FD_ENABLE_MULTI_API_SERVER * [fix] avoid api server from creating engine_args twice * [fix] fix test_run_batch * [fix] fix test_metrics * [fix] fix splitwise connector init * [test] add test_rdma_transfer and test_expert_service * [fix] fix code syntax * [fix] fix test_rdma_transfer and build wheel with rdma script
…ated deployment, and refactor post-init and usage for all ports (PaddlePaddle#5415) * [feat] simplify configuration for pd-disaggregated deployment, and refactor post-init and usage for all ports * [fix] fix some bugs * [fix] fix rdma port for cache manager/messager * [fix] temporarily cancel port availability check to see if it can pass ci test * [feat] simplify args for multi api server * [fix] fix dp * [fix] fix port for xpu * [fix] add tests for ports post processing & fix ci * [test] fix test_multi_api_server * [fix] fix rdma_comm_ports args for multi_api_server * [fix] fix test_common_engine * [fix] fix test_cache_transfer_manager * [chore] automatically setting FD_ENABLE_MULTI_API_SERVER * [fix] avoid api server from creating engine_args twice * [fix] fix test_run_batch * [fix] fix test_metrics * [fix] fix splitwise connector init * [test] add test_rdma_transfer and test_expert_service * [fix] fix code syntax * [fix] fix test_rdma_transfer and build wheel with rdma script
Motivation
该 PR 旨在实现两个目标:
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.