-
Notifications
You must be signed in to change notification settings - Fork 243
fix: prevent port overflow in vLLM server with high data parallelism (fixes #652) #653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…w in multi-node setups
Summary of ChangesHello @HsiaoTsan, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively resolves a critical port overflow issue in multi-node configurations by correctly calculating node-local server indices. The fix is concise and well-commented. The addition of a comprehensive suite of 20 unit tests is particularly commendable, as it thoroughly validates the port allocation logic across various scenarios, including the specific bug case, multi-node setups, and different parallelism configurations. My review includes a few minor suggestions to improve code style and consistency.
- Refactor multi-line assertions to if-raise pattern to resolve black/ruff conflicts - Remove unused variables in test file (gpus_per_server, ports_per_server) - Apply pre-commit formatting fixes (trailing whitespace, markdown formatting) - All files now pass both black and ruff format checks
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively resolves a critical port overflow issue that occurred in multi-node environments with high data parallelism. The fix, which correctly calculates a node-local server index using the modulo operator, is sound and well-targeted. The addition of a comprehensive test suite is excellent, as it covers numerous scenarios and ensures the stability of the port allocation logic, preventing future regressions. I've included a few minor suggestions to improve code style in the implementation and tests, and to correct a small typo in the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively resolves a critical port overflow issue in multi-node, high data parallelism scenarios by correctly calculating node-local server indices. The fix is simple, well-commented, and robustly supported by an extensive new test suite that covers the specific bug, various configurations, and edge cases. The addition of these tests is a significant improvement. I've included a couple of minor suggestions to improve test code style and fix a documentation typo. Overall, this is an excellent contribution.
|
@garrett4wade @xssstory Hi maintainers, could you please check the workflow approval? Thanks. |
garrett4wade
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @HsiaoTsan , sorry for the late reply. LGTM but with a minor comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new test may be unnecessary.
Description
Fixes issue: OverflowError in vLLM server port allocation when using high data parallelism (e.g.,
allocation_mode=vllm:d12t1).This update corrects incorrect port-range calculations that occurred in multi-node environments due to the use of global GPU indices instead of node-local indices. The resulting overflow produced port values above 65535.
This implementation includes:
server_idx_offsetcomputation using modulo to ensure node-local indexingvllm_server.pywith minimal, backward-compatible changesThe key difference from the prior behavior is that port ranges are now computed per-node rather than globally, eliminating invalid port assignments such as
[65000–70000].Related Issue
Fixes #652
Addresses overflow errors observed when running
allocation_mode=vllm:d12p1t1+d4p1t1on 2×8-GPU clusters.Type of Change
Implementation Details
Core Change
areal/launcher/vllm_server.py% n_servers_per_nodetoserver_idx_offsetso that indices wrap correctly on multi-node setupsExample
Before (incorrect global indexing):
Produced invalid port ranges such as
(65000, 70000).After (node-local indexing):
Ports stay within valid 0–65535 range.
Tests
areal/tests/test_vllm_server_launcher.pytest_high_data_parallelism_d12_no_overflowAll tests pass:
Validated configurations:
vllm:d12t1(previously failing)vllm:d16t1Checklist
Breaking Change Details
Not applicable — this change is fully backward compatible.
Additional Context
This fix ensures stable port allocation for high data-parallel configurations in multi-node environments, resolving prior failures caused by exceeding valid port ranges.