-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] [Ray Autoscaler] [Core] Ray Worker Node Relaunching during 'ray up' #20402
Comments
By relaunches the worker do you mean restarts Ray on the worker? Ray up restarts Ray across the cluster by default. Let me know if that makes sense / solves the issue. |
Got it. |
|
had a typo in path: those look like driver logs (as opposed to autoscaler logs) Those logs are helpful, though. Logs for the the thread that is supposed to restart Ray on the worker I think are |
ok, seeing the weirdness with the default example configs |
Ray start output when attempting to restart the worker's ray on the second ray up:
|
@kfstorm @wuisawesome What does the error message in the last comment mean? I see it mentions containers -- we do have those here. |
Thanks for the investigation @DmitriGekhtman. FYI this showed up even if Docker is not used - e.g., |
I'm not sure about this. It seems that the registered IP address of Raylet doesn't match the one detected by the driver. So the driver cannot find the local Raylet instance to connect to. @ConeyLiu Any thoughts? |
This looks pretty bad -- I'm seeing in this in other contexts where we try to restart Ray on a node. |
Any update? We could work around this by delaying |
Leaving this exclusively to @wuisawesome, since this issue appears to have a Ray-internal component, and that's a good enough reason to disqualify myself. |
Sgtm, we still encounter this issue pretty frequently and it'd be great if this issue is resolved soon. |
Possibly related to #19834? |
Yeah, fairly confident #19834 (comment) is related Basically, yeah, restarting ray on workers makes the worker + head nodes sad. Is this because |
A dumb workaround is to try and issue an extra See surround code + files for full repro. |
Yeah, not sure if this is new info, but |
Search before asking
Ray Component
Ray Clusters
What happened + What you expected to happen
Ray Autoscaler will relaunch the worker even if the head and worker node are both healthy and their file systems are identical.
This can be replicated by running
ray up
on most Autoscaler configuration files over and over again.@concretevitamin @ericl
Versions / Dependencies
Most recent version of Ray and Ray Autoscaler.
Reproduction script
Autoscaler Config provided below. Run
ray up -y config/aws-distributed.yml --no-config-cache
one time and wait (important!) until the worker is fully setup viaray status
. Rinse and repeat on the same configuration file. Eventually, on one of the runs, the Autoscaler will relaunch the worker node.Anything else
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: