[FIX]Patch run-cluster.sh (fix for #28328) #30002

evberrypi · 2025-12-03T22:24:03Z

Purpose

This PR corrects the addition for 28328 by removing the new block to build vLLM variables and instead just adds the node ip address to the ray start command. This fixes any issues when a device has multiple interfaces.
Use case: Device has ethernet & QSFP cabling and we want ray to serve over the high speed link.

Test Plan

In our scenario we are running 2x devices with 2 interfaces each -- one ethernet cable providing internet access and one QSFP cable connecting the two servers directly (P2P). We want to confirm the arguments passed to the run_cluster.sh script respect these input arguments and that Ray itself selects the P2P QSFP interfaces, instead of the default behavior to select the ethernet interface (public IP address) of both nodes.

Test Result

From node 1:

$ export VLLM_IMAGE=nvcr.io/nvidia/vllm:25.11-py3
$ export MN_IF_NAME=enp1s0f1np1
$ export VLLM_HOST_IP=$(ip -4 addr show $MN_IF_NAME | grep -oP '(?<=inet\s)\d+(\.\d+){3}')
$ echo "Using interface $MN_IF_NAME with IP $VLLM_HOST_IP"
Using interface enp1s0f1np1 with IP 192.168.200.12

$ bash run_cluster.sh $VLLM_IMAGE $VLLM_HOST_IP --head ~/.cache/huggingface   -e VLLM_HOST_IP=$VLLM_HOST_IP   -e UCX_NET_DEVICES=$MN_IF_NAME   -e NCCL_SOCKET_IFNAME=$MN_IF_NAME   -e OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME   -e GLOO_SOCKET_IFNAME=$MN_IF_NAME   -e TP_SOCKET_IFNAME=$MN_IF_NAME   -e RAY_memory_monitor_refresh_ms=0   -e MASTER_ADDR=$VLLM_HOST_IP
2025-12-05 20:17:59,542	INFO usage_lib.py:473 -- Usage stats collection is enabled by default without user confirmation because this terminal is detected to be non-interactive. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2025-12-05 20:17:59,542	INFO scripts.py:914 -- Local node IP: 192.168.200.12
2025-12-05 20:18:00,748	SUCC scripts.py:950 -- --------------------
2025-12-05 20:18:00,748	SUCC scripts.py:951 -- Ray runtime started.
2025-12-05 20:18:00,748	SUCC scripts.py:952 -- --------------------
2025-12-05 20:18:00,748	INFO scripts.py:954 -- Next steps
2025-12-05 20:18:00,748	INFO scripts.py:957 -- To add another node to this Ray cluster, run
2025-12-05 20:18:00,748	INFO scripts.py:960 --   ray start --address='192.168.200.12:6379'
2025-12-05 20:18:00,748	INFO scripts.py:969 -- To connect to this Ray cluster:
2025-12-05 20:18:00,748	INFO scripts.py:971 -- import ray
2025-12-05 20:18:00,748	INFO scripts.py:972 -- ray.init(_node_ip_address='192.168.200.12')
2025-12-05 20:18:00,748	INFO scripts.py:984 -- To submit a Ray job using the Ray Jobs CLI:
2025-12-05 20:18:00,749	INFO scripts.py:985 --   RAY_API_SERVER_ADDRESS='http://127.0.0.1:8265/' ray job submit --working-dir . -- python my_script.py
2025-12-05 20:18:00,749	INFO scripts.py:994 -- See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html 
2025-12-05 20:18:00,749	INFO scripts.py:998 -- for more information on submitting Ray jobs to the Ray cluster.
2025-12-05 20:18:00,749	INFO scripts.py:1003 -- To terminate the Ray runtime, run
2025-12-05 20:18:00,749	INFO scripts.py:1004 --   ray stop
2025-12-05 20:18:00,749	INFO scripts.py:1007 -- To view the status of the cluster, use
2025-12-05 20:18:00,749	INFO scripts.py:1008 --   ray status
2025-12-05 20:18:00,749	INFO scripts.py:1012 -- To monitor and debug Ray, view the dashboard at 
2025-12-05 20:18:00,749	INFO scripts.py:1013 --   127.0.0.1:8265
2025-12-05 20:18:00,749	INFO scripts.py:1020 -- If connection to the dashboard fails, check your firewall settings and network configuration.
2025-12-05 20:18:00,749	INFO scripts.py:1121 -- --block
2025-12-05 20:18:00,749	INFO scripts.py:1122 -- This command will now block forever until terminated by a signal.
2025-12-05 20:18:00,749	INFO scripts.py:1125 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.

From Node 2

$ export VLLM_IMAGE=nvcr.io/nvidia/vllm:25.11-py3
$ export MN_IF_NAME=enp1s0f1np1
$ export VLLM_HOST_IP=$(ip -4 addr show $MN_IF_NAME | grep -oP '(?<=inet\s)\d+(\.\d+){3}')
$ echo "Using interface $MN_IF_NAME with IP $VLLM_HOST_IP"
Using interface enp1s0f1np1 with IP 192.168.200.13
$ export HEAD_NODE_IP=192.168.200.12

$ bash run_cluster.sh $VLLM_IMAGE $HEAD_NODE_IP --worker ~/.cache/huggingface   -e VLLM_HOST_IP=$VLLM_HOST_IP   -e UCX_NET_DEVICES=$MN_IF_NAME   -e NCCL_SOCKET_IFNAME=$MN_IF_NAME   -e OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME   -e GLOO_SOCKET_IFNAME=$MN_IF_NAME   -e TP_SOCKET_IFNAME=$MN_IF_NAME   -e RAY_memory_monitor_refresh_ms=0   -e MASTER_ADDR=$HEAD_NODE_IP
2025-12-05 20:19:22,139	INFO scripts.py:1095 -- Local node IP: 192.168.200.13
2025-12-05 20:19:22,350	SUCC scripts.py:1108 -- --------------------
2025-12-05 20:19:22,350	SUCC scripts.py:1109 -- Ray runtime started.
2025-12-05 20:19:22,350	SUCC scripts.py:1110 -- --------------------
2025-12-05 20:19:22,350	INFO scripts.py:1112 -- To terminate the Ray runtime, run
2025-12-05 20:19:22,350	INFO scripts.py:1113 --   ray stop
2025-12-05 20:19:22,351	INFO scripts.py:1121 -- --block
2025-12-05 20:19:22,351	INFO scripts.py:1122 -- This command will now block forever until terminated by a signal.
2025-12-05 20:19:22,351	INFO scripts.py:1125 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.

Ray status:

$ docker exec $VLLM_CONTAINER ray status
======== Autoscaler status: 2025-12-05 20:19:35.263362 ========
Node status
---------------------------------------------------------------
Active:
 1 node_e4e6c7bbd25f4ce2fa1ed0cf2e71680b425d7b06922120a28cc5d371
 1 node_b3d73a90f8f40ad9f1aeefcbe7653e04cfe80c7f22b511f78349edf9
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Total Usage:
 0.0/40.0 CPU
 0.0/2.0 GPU
 0B/218.88GiB memory
 0B/19.46GiB object_store_memory

From request_resources:
 (none)
Pending Demands:
 (no resource demands)

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

mergify · 2025-12-03T22:24:37Z

Documentation preview: https://vllm--30002.org.readthedocs.build/en/30002/

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

examples/online_serving/run_cluster.sh

gemini-code-assist

Code Review

This pull request aims to simplify how the node IP address is passed to Ray for multi-NIC setups. The change for the head node is correct, but the implementation for worker nodes introduces a bug where the VLLM_HOST_IP variable is used without being defined, breaking the intended functionality. I've provided a critical review comment with a suggested fix to correctly parse the worker IP from the arguments.

examples/online_serving/run_cluster.sh

Signed-off-by: Ev Lacey <github@everettlacey.com>

ywang96 · 2025-12-05T08:52:27Z

cc @mgoin

Patch run-cluster.sh

1139f6b

mergify bot added the documentation Improvements or additions to documentation label Dec 3, 2025

chatgpt-codex-connector bot reviewed Dec 3, 2025

View reviewed changes

examples/online_serving/run_cluster.sh Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Dec 3, 2025

View reviewed changes

examples/online_serving/run_cluster.sh Show resolved Hide resolved

evberrypi changed the title ~~Patch run-cluster.sh (fix for #28328)~~ [FIX]Patch run-cluster.sh (fix for #28328) Dec 4, 2025

evberrypi and others added 2 commits December 3, 2025 20:27

Merge branch 'main' into elacey/patch-run_cluster.sh

8457b1a

Build VLLM_HOST_IP from additional args

d59e1c4

Signed-off-by: Ev Lacey <github@everettlacey.com>

remove extra for loop

601652f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[FIX]Patch run-cluster.sh (fix for #28328) #30002

[FIX]Patch run-cluster.sh (fix for #28328) #30002

evberrypi commented Dec 3, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Dec 3, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

ywang96 commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[FIX]Patch run-cluster.sh (fix for #28328) #30002

Are you sure you want to change the base?

[FIX]Patch run-cluster.sh (fix for #28328) #30002

Conversation

evberrypi commented Dec 3, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

mergify bot commented Dec 3, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

ywang96 commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

evberrypi commented Dec 3, 2025 •

edited by github-actions bot

Loading