-
-
Notifications
You must be signed in to change notification settings - Fork 11.2k
Fix v1 supported oracle for worker-cls and worker-extension-cls #15324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
full error log: @comaniac it seems passing the placement group is not enough. the new process fails to recognize the old ray cluster. 2d1e722 should fix it. |
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
| if self.worker_cls != EngineArgs.worker_cls: | ||
| _raise_or_fallback(feature_name="--worker-cls", | ||
| recommend_to_remove=False) | ||
| return False | ||
|
|
||
| if self.worker_extension_cls != EngineArgs.worker_extension_cls: | ||
| _raise_or_fallback(feature_name="--worker-extension-cls", | ||
| recommend_to_remove=False) | ||
| return False | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @robertgshaw2-redhat for visibility.
| - pushd ../examples/offline_inference | ||
| - python3 rlhf.py | ||
| - RAY_DEDUP_LOGS=0 python3 rlhf_colocate.py | ||
| - VLLM_ENABLE_V1_MULTIPROCESSING=0 python3 rlhf.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in v1, LLM.collective_rpc is broken in the default case, because self.llm_engine.model_executor is in a different process.
cc @njhill @robertgshaw2-redhat if you can help fix it. i'm disabling VLLM_ENABLE_V1_MULTIPROCESSING for the test right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should fall back to this automatically if the user selects --worker-cls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we don't need to fall back. worker-cls and rlhf can be separate things.
vllm/utils.py
Outdated
| # even we choose to spawn, we need to pass to the subprocess | ||
| # the ray address, so that it knows to connect to the ray cluster. | ||
| # env vars are inherited by subprocesses, even if we use spawn. | ||
| import ray | ||
| os.environ["RAY_ADDRESS"] = ray.get_runtime_context().gcs_address |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a follow-up after #14705
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah make sense. I only tested the Ray Serve use case before. cc @ruisearch42
ruisearch42
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG
| # even we choose to spawn, we need to pass to the subprocess | ||
| # the ray address, so that it knows to connect to the ray cluster. | ||
| # env vars are inherited by subprocesses, even if we use spawn. | ||
| import ray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ray is not always installed for vllm, we need a try-except like is_in_ray_actor()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is guarded under is_in_ray_actor(), so it is safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's right, realized it afterwards
Signed-off-by: youkaichao <youkaichao@gmail.com>
…-project#15324) Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com>
…-project#15324) Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Wes Medford <wryanmedford@gmail.com>
…-project#15324) Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
…-project#15324) Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com>
…-project#15324) Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com>
…-project#15324) Signed-off-by: youkaichao <youkaichao@gmail.com> Co-authored-by: youkaichao <youkaichao@gmail.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
No description provided.