[Bugfix] Fix ray instance detect issue #9439

yma11 · 2024-10-17T00:49:45Z

Fix ray instance detect so that will first try connecting to latest launched instance and if not, create a new one with num_gpus=parallel_config.world_size.

github-actions · 2024-10-17T00:49:58Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

vllm/executor/ray_utils.py

yma11 · 2024-10-21T02:28:11Z

@youkaichao can you help review this change? Thanks.

comaniac

Overall LGTM. One question is should we also change the other branch?

comaniac · 2024-10-22T06:08:03Z

vllm/executor/ray_utils.py

-                 num_gpus=parallel_config.world_size)
+        # Try to connect existing ray instance and create a new one if not found
+        try:
+            ray.init('auto')


Use double quotes for consistency.

yma11 · 2024-10-22T07:09:23Z

Overall LGTM. One question is should we also change the other branch?

Agree. So I unify the init logic which should make sense for all platforms, please help take a look again. Thanks.

comaniac · 2024-10-22T07:25:37Z

The update code seems not equivalent to the original one? Currently for non-hip and non-xpu cases we don't init Ray with all GPUs.

yma11 · 2024-10-22T09:40:14Z

The update code seems not equivalent to the original one? Currently for non-hip and non-xpu cases we don't init Ray with all GPUs.

For non-hip and non-xpu cases, it will finally create a local instance with detected gpus if fails to connect existing cluster based on explanation.
Actually I intended to fix an error "When connecting to an existing cluster, num_cpus and num_gpus must not be provided." in xpu case. It happens when a valid ray_address and num_gpus are both given. I want to respect both of these values but seems the confliction can't be resolved. Maybe it's more reasonable to do ray.init(address=ray_address, ignore_reinit_error=True) for all platforms. num_gpus=parallel_config.world_size is expected to take affect only when new local instance created but it's not so meaningful in that case. What do you think?

comaniac · 2024-10-22T15:29:55Z

Sounds reasonable to me, but cc @rkooo567 @richardliaw to double check.

youkaichao · 2024-10-24T03:10:43Z

@yma11 please resolve the conflict

yma11 · 2024-10-24T03:47:08Z

@youkaichao Thanks for reminder. @comaniac I switched the fix back to only change hip and xpu code path since there is an possible issue on these platforms. When there is no ray cluster existing and trying to launch a new instance, Ray may can't detect correct GPU numbers thus will cause no GPU resources available for ray worker allocation. So we need give num_gpus as the argument in this case. That's why this specific code path exists here. FYI and thanks for your review.

Signed-off-by: yan ma <yan.ma@intel.com>

youkaichao · 2024-10-27T23:33:26Z

@DarkLight1337 please help check is the error related or it occurs in the main branch previously?

DarkLight1337 · 2024-10-28T03:03:12Z

It is a failure from main branch that has since been fixed. You can force merge this.

Signed-off-by: qishuai <ferdinandzhong@gmail.com>

Signed-off-by: Randall Smith <Randall.Smith@amd.com>

Signed-off-by: Sumit Dubey <sumit.dubey2@ibm.com>

Signed-off-by: LeiWang1999 <leiwang1999@outlook.com>

russellb requested changes Oct 17, 2024

View reviewed changes

vllm/executor/ray_utils.py Outdated Show resolved Hide resolved

yma11 force-pushed the ray-fix branch from 8e65f36 to 2147ca5 Compare October 18, 2024 07:21

yma11 force-pushed the ray-fix branch from 2147ca5 to 13f04c4 Compare October 22, 2024 06:03

comaniac reviewed Oct 22, 2024

View reviewed changes

youkaichao assigned rkooo567 Oct 24, 2024

rkooo567 approved these changes Oct 24, 2024

View reviewed changes

yma11 force-pushed the ray-fix branch from 74b9123 to a57b4bc Compare October 24, 2024 03:39

yma11 force-pushed the ray-fix branch 2 times, most recently from bc652ab to af18da6 Compare October 24, 2024 11:55

Fix ray instance detect issue

af18da6

Signed-off-by: yan ma <yan.ma@intel.com>

comaniac added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 24, 2024

comaniac enabled auto-merge (squash) October 24, 2024 15:27

Merge branch 'main' into ray-fix

55e6b39

comaniac merged commit 2adb440 into vllm-project:main Oct 28, 2024
58 checks passed

HollowMan6 mentioned this pull request Oct 28, 2024

[Bugfix] No num_gpus for ROCm and XPU when connecting to a ray cluster #8781

Closed

FerdinandZhong pushed a commit to FerdinandZhong/vllm that referenced this pull request Oct 29, 2024

[Bugfix] Fix ray instance detect issue (vllm-project#9439)

499cad6

Signed-off-by: qishuai <ferdinandzhong@gmail.com>

rasmith pushed a commit to rasmith/vllm that referenced this pull request Oct 30, 2024

[Bugfix] Fix ray instance detect issue (vllm-project#9439)

a26f8ea

Signed-off-by: Randall Smith <Randall.Smith@amd.com>

sumitd2 pushed a commit to sumitd2/vllm that referenced this pull request Nov 14, 2024

[Bugfix] Fix ray instance detect issue (vllm-project#9439)

9ad4845

Signed-off-by: Sumit Dubey <sumit.dubey2@ibm.com>

sleepwalker2017 pushed a commit to sleepwalker2017/vllm that referenced this pull request Dec 13, 2024

[Bugfix] Fix ray instance detect issue (vllm-project#9439)

c211880

LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025

[Bugfix] Fix ray instance detect issue (vllm-project#9439)

a476a3b

Signed-off-by: LeiWang1999 <leiwang1999@outlook.com>

yma11 deleted the ray-fix branch May 27, 2025 02:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix] Fix ray instance detect issue #9439

[Bugfix] Fix ray instance detect issue #9439

Uh oh!

yma11 commented Oct 17, 2024

Uh oh!

github-actions bot commented Oct 17, 2024

Uh oh!

Uh oh!

yma11 commented Oct 21, 2024

Uh oh!

comaniac left a comment

Uh oh!

comaniac Oct 22, 2024

Uh oh!

yma11 commented Oct 22, 2024

Uh oh!

comaniac commented Oct 22, 2024

Uh oh!

yma11 commented Oct 22, 2024 •

edited

Loading

Uh oh!

comaniac commented Oct 22, 2024

Uh oh!

youkaichao commented Oct 24, 2024

Uh oh!

yma11 commented Oct 24, 2024

Uh oh!

youkaichao commented Oct 27, 2024

Uh oh!

DarkLight1337 commented Oct 28, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Bugfix] Fix ray instance detect issue #9439

[Bugfix] Fix ray instance detect issue #9439

Uh oh!

Conversation

yma11 commented Oct 17, 2024

Uh oh!

github-actions bot commented Oct 17, 2024

Uh oh!

Uh oh!

yma11 commented Oct 21, 2024

Uh oh!

comaniac left a comment

Choose a reason for hiding this comment

Uh oh!

comaniac Oct 22, 2024

Choose a reason for hiding this comment

Uh oh!

yma11 commented Oct 22, 2024

Uh oh!

comaniac commented Oct 22, 2024

Uh oh!

yma11 commented Oct 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

comaniac commented Oct 22, 2024

Uh oh!

youkaichao commented Oct 24, 2024

Uh oh!

yma11 commented Oct 24, 2024

Uh oh!

youkaichao commented Oct 27, 2024

Uh oh!

DarkLight1337 commented Oct 28, 2024

Uh oh!

Uh oh!

Uh oh!

yma11 commented Oct 22, 2024 •

edited

Loading