Skip to content

Conversation

@evberrypi
Copy link
Contributor

@evberrypi evberrypi commented Dec 3, 2025

Purpose

This PR corrects the addition for 28328 by removing the new block to build vLLM variables and instead just adds the node ip address to the ray start command. This fixes any issues when a device has multiple interfaces.
Use case: Device has ethernet & QSFP cabling and we want ray to serve over the high speed link.

Test Plan

In our scenario we are running 2x devices with 2 interfaces each -- one ethernet cable providing internet access and one QSFP cable connecting the two servers directly (P2P). We want to confirm the arguments passed to the run_cluster.sh script respect these input arguments and that Ray itself selects the P2P QSFP interfaces, instead of the default behavior to select the ethernet interface (public IP address) of both nodes.

Test Result

From node 1:

$ export VLLM_IMAGE=nvcr.io/nvidia/vllm:25.11-py3
$ export MN_IF_NAME=enp1s0f1np1
$ export VLLM_HOST_IP=$(ip -4 addr show $MN_IF_NAME | grep -oP '(?<=inet\s)\d+(\.\d+){3}')
$ echo "Using interface $MN_IF_NAME with IP $VLLM_HOST_IP"
Using interface enp1s0f1np1 with IP 192.168.200.12

$ bash run_cluster.sh $VLLM_IMAGE $VLLM_HOST_IP --head ~/.cache/huggingface   -e VLLM_HOST_IP=$VLLM_HOST_IP   -e UCX_NET_DEVICES=$MN_IF_NAME   -e NCCL_SOCKET_IFNAME=$MN_IF_NAME   -e OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME   -e GLOO_SOCKET_IFNAME=$MN_IF_NAME   -e TP_SOCKET_IFNAME=$MN_IF_NAME   -e RAY_memory_monitor_refresh_ms=0   -e MASTER_ADDR=$VLLM_HOST_IP
2025-12-05 20:17:59,542	INFO usage_lib.py:473 -- Usage stats collection is enabled by default without user confirmation because this terminal is detected to be non-interactive. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.
2025-12-05 20:17:59,542	INFO scripts.py:914 -- Local node IP: 192.168.200.12
2025-12-05 20:18:00,748	SUCC scripts.py:950 -- --------------------
2025-12-05 20:18:00,748	SUCC scripts.py:951 -- Ray runtime started.
2025-12-05 20:18:00,748	SUCC scripts.py:952 -- --------------------
2025-12-05 20:18:00,748	INFO scripts.py:954 -- Next steps
2025-12-05 20:18:00,748	INFO scripts.py:957 -- To add another node to this Ray cluster, run
2025-12-05 20:18:00,748	INFO scripts.py:960 --   ray start --address='192.168.200.12:6379'
2025-12-05 20:18:00,748	INFO scripts.py:969 -- To connect to this Ray cluster:
2025-12-05 20:18:00,748	INFO scripts.py:971 -- import ray
2025-12-05 20:18:00,748	INFO scripts.py:972 -- ray.init(_node_ip_address='192.168.200.12')
2025-12-05 20:18:00,748	INFO scripts.py:984 -- To submit a Ray job using the Ray Jobs CLI:
2025-12-05 20:18:00,749	INFO scripts.py:985 --   RAY_API_SERVER_ADDRESS='http://127.0.0.1:8265/' ray job submit --working-dir . -- python my_script.py
2025-12-05 20:18:00,749	INFO scripts.py:994 -- See https://docs.ray.io/en/latest/cluster/running-applications/job-submission/index.html 
2025-12-05 20:18:00,749	INFO scripts.py:998 -- for more information on submitting Ray jobs to the Ray cluster.
2025-12-05 20:18:00,749	INFO scripts.py:1003 -- To terminate the Ray runtime, run
2025-12-05 20:18:00,749	INFO scripts.py:1004 --   ray stop
2025-12-05 20:18:00,749	INFO scripts.py:1007 -- To view the status of the cluster, use
2025-12-05 20:18:00,749	INFO scripts.py:1008 --   ray status
2025-12-05 20:18:00,749	INFO scripts.py:1012 -- To monitor and debug Ray, view the dashboard at 
2025-12-05 20:18:00,749	INFO scripts.py:1013 --   127.0.0.1:8265
2025-12-05 20:18:00,749	INFO scripts.py:1020 -- If connection to the dashboard fails, check your firewall settings and network configuration.
2025-12-05 20:18:00,749	INFO scripts.py:1121 -- --block
2025-12-05 20:18:00,749	INFO scripts.py:1122 -- This command will now block forever until terminated by a signal.
2025-12-05 20:18:00,749	INFO scripts.py:1125 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.

From Node 2

$ export VLLM_IMAGE=nvcr.io/nvidia/vllm:25.11-py3
$ export MN_IF_NAME=enp1s0f1np1
$ export VLLM_HOST_IP=$(ip -4 addr show $MN_IF_NAME | grep -oP '(?<=inet\s)\d+(\.\d+){3}')
$ echo "Using interface $MN_IF_NAME with IP $VLLM_HOST_IP"
Using interface enp1s0f1np1 with IP 192.168.200.13
$ export HEAD_NODE_IP=192.168.200.12

$ bash run_cluster.sh $VLLM_IMAGE $HEAD_NODE_IP --worker ~/.cache/huggingface   -e VLLM_HOST_IP=$VLLM_HOST_IP   -e UCX_NET_DEVICES=$MN_IF_NAME   -e NCCL_SOCKET_IFNAME=$MN_IF_NAME   -e OMPI_MCA_btl_tcp_if_include=$MN_IF_NAME   -e GLOO_SOCKET_IFNAME=$MN_IF_NAME   -e TP_SOCKET_IFNAME=$MN_IF_NAME   -e RAY_memory_monitor_refresh_ms=0   -e MASTER_ADDR=$HEAD_NODE_IP
2025-12-05 20:19:22,139	INFO scripts.py:1095 -- Local node IP: 192.168.200.13
2025-12-05 20:19:22,350	SUCC scripts.py:1108 -- --------------------
2025-12-05 20:19:22,350	SUCC scripts.py:1109 -- Ray runtime started.
2025-12-05 20:19:22,350	SUCC scripts.py:1110 -- --------------------
2025-12-05 20:19:22,350	INFO scripts.py:1112 -- To terminate the Ray runtime, run
2025-12-05 20:19:22,350	INFO scripts.py:1113 --   ray stop
2025-12-05 20:19:22,351	INFO scripts.py:1121 -- --block
2025-12-05 20:19:22,351	INFO scripts.py:1122 -- This command will now block forever until terminated by a signal.
2025-12-05 20:19:22,351	INFO scripts.py:1125 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.

Ray status:

$ docker exec $VLLM_CONTAINER ray status
======== Autoscaler status: 2025-12-05 20:19:35.263362 ========
Node status
---------------------------------------------------------------
Active:
 1 node_e4e6c7bbd25f4ce2fa1ed0cf2e71680b425d7b06922120a28cc5d371
 1 node_b3d73a90f8f40ad9f1aeefcbe7653e04cfe80c7f22b511f78349edf9
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Total Usage:
 0.0/40.0 CPU
 0.0/2.0 GPU
 0B/218.88GiB memory
 0B/19.46GiB object_store_memory

From request_resources:
 (none)
Pending Demands:
 (no resource demands)

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify
Copy link

mergify bot commented Dec 3, 2025

Documentation preview: https://vllm--30002.org.readthedocs.build/en/30002/

@mergify mergify bot added the documentation Improvements or additions to documentation label Dec 3, 2025
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to simplify how the node IP address is passed to Ray for multi-NIC setups. The change for the head node is correct, but the implementation for worker nodes introduces a bug where the VLLM_HOST_IP variable is used without being defined, breaking the intended functionality. I've provided a critical review comment with a suggested fix to correctly parse the worker IP from the arguments.

@evberrypi evberrypi changed the title Patch run-cluster.sh (fix for #28328) [FIX]Patch run-cluster.sh (fix for #28328) Dec 4, 2025
@ywang96
Copy link
Member

ywang96 commented Dec 5, 2025

cc @mgoin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants