Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bugfix] Fix ray instance detect issue #9439

Merged
merged 2 commits into from
Oct 28, 2024
Merged

Conversation

yma11
Copy link
Contributor

@yma11 yma11 commented Oct 17, 2024

Fix ray instance detect so that will first try connecting to latest launched instance and if not, create a new one with num_gpus=parallel_config.world_size.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

  • Add ready label to the PR
  • Enable auto-merge.

🚀

vllm/executor/ray_utils.py Outdated Show resolved Hide resolved
@yma11
Copy link
Contributor Author

yma11 commented Oct 21, 2024

@youkaichao can you help review this change? Thanks.

Copy link
Collaborator

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. One question is should we also change the other branch?

num_gpus=parallel_config.world_size)
# Try to connect existing ray instance and create a new one if not found
try:
ray.init('auto')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use double quotes for consistency.

@yma11
Copy link
Contributor Author

yma11 commented Oct 22, 2024

Overall LGTM. One question is should we also change the other branch?

Agree. So I unify the init logic which should make sense for all platforms, please help take a look again. Thanks.

@comaniac
Copy link
Collaborator

The update code seems not equivalent to the original one? Currently for non-hip and non-xpu cases we don't init Ray with all GPUs.

@yma11
Copy link
Contributor Author

yma11 commented Oct 22, 2024

The update code seems not equivalent to the original one? Currently for non-hip and non-xpu cases we don't init Ray with all GPUs.

For non-hip and non-xpu cases, it will finally create a local instance with detected gpus if fails to connect existing cluster based on explanation.
Actually I intended to fix an error "When connecting to an existing cluster, num_cpus and num_gpus must not be provided." in xpu case. It happens when a valid ray_address and num_gpus are both given. I want to respect both of these values but seems the confliction can't be resolved. Maybe it's more reasonable to do ray.init(address=ray_address, ignore_reinit_error=True) for all platforms. num_gpus=parallel_config.world_size is expected to take affect only when new local instance created but it's not so meaningful in that case. What do you think?

@comaniac
Copy link
Collaborator

Sounds reasonable to me, but cc @rkooo567 @richardliaw to double check.

@youkaichao
Copy link
Member

@yma11 please resolve the conflict

@yma11
Copy link
Contributor Author

yma11 commented Oct 24, 2024

@youkaichao Thanks for reminder. @comaniac I switched the fix back to only change hip and xpu code path since there is an possible issue on these platforms. When there is no ray cluster existing and trying to launch a new instance, Ray may can't detect correct GPU numbers thus will cause no GPU resources available for ray worker allocation. So we need give num_gpus as the argument in this case. That's why this specific code path exists here. FYI and thanks for your review.

@yma11 yma11 force-pushed the ray-fix branch 2 times, most recently from bc652ab to af18da6 Compare October 24, 2024 11:55
Signed-off-by: yan ma <yan.ma@intel.com>
@comaniac comaniac added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 24, 2024
@comaniac comaniac enabled auto-merge (squash) October 24, 2024 15:27
@youkaichao
Copy link
Member

@DarkLight1337 please help check is the error related or it occurs in the main branch previously?

@DarkLight1337
Copy link
Member

It is a failure from main branch that has since been fixed. You can force merge this.

@comaniac comaniac merged commit 2adb440 into vllm-project:main Oct 28, 2024
58 checks passed
cooleel pushed a commit to cooleel/vllm that referenced this pull request Oct 28, 2024
Signed-off-by: Shanshan Wang <shanshan.wang@h2o.ai>
cooleel pushed a commit to cooleel/vllm that referenced this pull request Oct 28, 2024
Signed-off-by: Shanshan Wang <shanshan.wang@h2o.ai>
FerdinandZhong pushed a commit to FerdinandZhong/vllm that referenced this pull request Oct 29, 2024
Signed-off-by: qishuai <ferdinandzhong@gmail.com>
rasmith pushed a commit to rasmith/vllm that referenced this pull request Oct 30, 2024
Signed-off-by: Randall Smith <Randall.Smith@amd.com>
NickLucche pushed a commit to NickLucche/vllm that referenced this pull request Oct 31, 2024
Signed-off-by: NickLucche <nlucches@redhat.com>
NickLucche pushed a commit to NickLucche/vllm that referenced this pull request Oct 31, 2024
Signed-off-by: NickLucche <nlucches@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants