fix: prevent port overflow in vLLM server with high data parallelism (fixes #652) #653

HsiaoTsan · 2025-11-28T23:35:08Z

Description

Fixes issue: OverflowError in vLLM server port allocation when using high data parallelism (e.g., allocation_mode=vllm:d12t1).

This update corrects incorrect port-range calculations that occurred in multi-node environments due to the use of global GPU indices instead of node-local indices. The resulting overflow produced port values above 65535.

This implementation includes:

Correct server_idx_offset computation using modulo to ensure node-local indexing
Prevents port overflows when data parallelism exceeds the number of GPUs on a single node
Adds comprehensive test coverage across multi-node and multi-parallelism setups
Validates edge cases including partial GPU visibility
Updates port-allocation logic in vllm_server.py with minimal, backward-compatible changes

The key difference from the prior behavior is that port ranges are now computed per-node rather than globally, eliminating invalid port assignments such as [65000–70000].

Related Issue

Fixes #652
Addresses overflow errors observed when running allocation_mode=vllm:d12p1t1+d4p1t1 on 2×8-GPU clusters.

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature
Breaking change
Documentation update (inline comments + docstrings)
Code refactoring
Test coverage improvement

Implementation Details

Core Change

File: areal/launcher/vllm_server.py
Modification: Added % n_servers_per_node to server_idx_offset so that indices wrap correctly on multi-node setups
Diff Size: +3 / −1 lines

Example

Before (incorrect global indexing):
Produced invalid port ranges such as (65000, 70000).

After (node-local indexing):
Ports stay within valid 0–65535 range.

Tests

File: areal/tests/test_vllm_server_launcher.py
20 unit tests added, including:
- Port allocation across d1–d16, t1–t8
- Multi-node configurations (up to 5 nodes)
- Boundary and overflow scenarios
- Key test: test_high_data_parallelism_d12_no_overflow

All tests pass:

pytest areal/tests/test_vllm_server_launcher.py -v
======================== 20 passed in 0.19s ========================

Validated configurations:

vllm:d12t1 (previously failing)
vllm:d16t1
Various tensor-parallelism modes (t2, t4, t8)
Partial GPU visibility (CUDA_VISIBLE_DEVICES)

Checklist

I have read the Contributing Guide
I have run formatting tools
All relevant unit tests pass
Tests added for new behavior
Documentation updated where appropriate
Branch up to date with main
No breaking changes introduced
No critical issues raised by AI reviewers

Breaking Change Details

Not applicable — this change is fully backward compatible.

Additional Context

This fix ensures stable port allocation for high data-parallel configurations in multi-node environments, resolving prior failures caused by exceeding valid port ranges.

…w in multi-node setups

…ource setup

gemini-code-assist · 2025-11-28T23:35:20Z

Summary of Changes

Hello @HsiaoTsan, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical OverflowError in the vLLM server's port allocation logic, which previously manifested under high data parallelism in multi-node setups. The issue was traced to an incorrect global indexing scheme for server offsets, causing calculated port ranges to exceed system limits. The solution involves a precise adjustment to the server index calculation, ensuring it remains node-local and within valid port boundaries. The change is thoroughly validated by a new suite of unit tests, confirming robust and correct port assignment across diverse configurations.

Highlights

Bug Fix: Port Overflow: Addresses and resolves an OverflowError that occurred in the vLLM server's port allocation when operating with high data parallelism (e.g., allocation_mode=vllm:d12t1) in multi-node environments. Previously, port ranges could exceed the valid 0-65535 range.
Root Cause Identified: The problem stemmed from the server_idx_offset calculation incorrectly using global GPU indices instead of node-local indices, leading to port range calculations that went beyond the maximum allowed port number.
Solution Implemented: A modulo operation (% n_servers_per_node) has been added to the server_idx_offset calculation in areal/launcher/vllm_server.py. This ensures that the server index offset is always node-local, preventing port overflows.
Comprehensive Testing: 20 new unit tests have been introduced in areal/tests/test_vllm_server_launcher.py. These tests cover various port allocation configurations, multi-node scenarios (up to 5 nodes), edge cases, and specifically validate the fix for the d12 data parallelism overflow.
Backward Compatibility: The implemented fix is fully backward compatible, ensuring that existing configurations continue to function without any changes or regressions.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request effectively resolves a critical port overflow issue in multi-node configurations by correctly calculating node-local server indices. The fix is concise and well-commented. The addition of a comprehensive suite of 20 unit tests is particularly commendable, as it thoroughly validates the port allocation logic across various scenarios, including the specific bug case, multi-node setups, and different parallelism configurations. My review includes a few minor suggestions to improve code style and consistency.

areal/launcher/vllm_server.py

areal/tests/test_vllm_server_launcher.py

- Refactor multi-line assertions to if-raise pattern to resolve black/ruff conflicts - Remove unused variables in test file (gpus_per_server, ports_per_server) - Apply pre-commit formatting fixes (trailing whitespace, markdown formatting) - All files now pass both black and ruff format checks

HsiaoTsan · 2025-11-29T01:38:10Z

/gemini review

gemini-code-assist

Code Review

This pull request effectively resolves a critical port overflow issue that occurred in multi-node environments with high data parallelism. The fix, which correctly calculates a node-local server index using the modulo operator, is sound and well-targeted. The addition of a comprehensive test suite is excellent, as it covers numerous scenarios and ensures the stability of the port allocation logic, preventing future regressions. I've included a few minor suggestions to improve code style in the implementation and tests, and to correct a small typo in the documentation.

areal/launcher/vllm_server.py

areal/tests/test_vllm_server_launcher.py

examples/experimental/proxy/README.md

gemini-code-assist

Code Review

This pull request effectively resolves a critical port overflow issue in multi-node, high data parallelism scenarios by correctly calculating node-local server indices. The fix is simple, well-commented, and robustly supported by an extensive new test suite that covers the specific bug, various configurations, and edge cases. The addition of these tests is a significant improvement. I've included a couple of minor suggestions to improve test code style and fix a documentation typo. Overall, this is an excellent contribution.

areal/tests/test_vllm_server_launcher.py

examples/experimental/proxy/README.md

HsiaoTsan · 2025-12-08T18:44:03Z

@garrett4wade @xssstory Hi maintainers, could you please check the workflow approval? Thanks.

garrett4wade

Hi @HsiaoTsan , sorry for the late reply. LGTM but with a minor comment.

garrett4wade · 2025-12-09T12:02:08Z

areal/tests/test_vllm_server_launcher.py

This new test may be unnecessary.

HsiaoTsan added 2 commits November 28, 2025 15:04

fix: ensure server index offset is node-local to prevent port overflo…

c294ba6

…w in multi-node setups

test: add unit tests for vLLM server launcher port allocation and res…

fb4259f

…ource setup

gemini-code-assist bot reviewed Nov 28, 2025

View reviewed changes

areal/launcher/vllm_server.py Outdated Show resolved Hide resolved

areal/tests/test_vllm_server_launcher.py Outdated Show resolved Hide resolved

areal/tests/test_vllm_server_launcher.py Outdated Show resolved Hide resolved

HsiaoTsan added 2 commits November 28, 2025 17:17

chore: reordering imports to pass format check

b6b786d

gemini-code-assist bot reviewed Nov 29, 2025

View reviewed changes

areal/launcher/vllm_server.py Show resolved Hide resolved

areal/tests/test_vllm_server_launcher.py Show resolved Hide resolved

examples/experimental/proxy/README.md Show resolved Hide resolved

gemini-code-assist bot reviewed Nov 29, 2025

View reviewed changes

areal/tests/test_vllm_server_launcher.py Show resolved Hide resolved

examples/experimental/proxy/README.md Show resolved Hide resolved

HsiaoTsan and others added 5 commits December 1, 2025 10:49

Merge branch 'main' into fix/vllm-port-overflow

ee8d497

Merge branch 'main' into fix/vllm-port-overflow

357720f

Merge branch 'main' into fix/vllm-port-overflow

2824c2c

Merge branch 'main' into fix/vllm-port-overflow

e26071f

Merge branch 'main' into fix/vllm-port-overflow

dc5d9b6

garrett4wade approved these changes Dec 9, 2025

View reviewed changes

areal/tests/test_vllm_server_launcher.py

Copy link

Collaborator

garrett4wade Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new test may be unnecessary.

fix: prevent port overflow in vLLM server with high data parallelism (fixes #652) #653

Are you sure you want to change the base?

fix: prevent port overflow in vLLM server with high data parallelism (fixes #652) #653

Uh oh!

Conversation

HsiaoTsan commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Type of Change

Implementation Details

Core Change

Example

Tests

Checklist

Breaking Change Details

Additional Context

Uh oh!

gemini-code-assist bot commented Nov 28, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HsiaoTsan commented Nov 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

HsiaoTsan commented Dec 8, 2025

Uh oh!

garrett4wade left a comment

Choose a reason for hiding this comment

Uh oh!

garrett4wade Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

HsiaoTsan commented Nov 28, 2025 •

edited

Loading