Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] [Ray Autoscaler] [Core] Ray Worker Node Relaunching during 'ray up' #20402

Open
1 of 2 tasks
michaelzhiluo opened this issue Nov 16, 2021 · 18 comments
Open
1 of 2 tasks
Assignees
Labels
bug Something that is supposed to be working; but isn't infra autoscaler, ray client, kuberay, related issues P2 Important issue, but not time-critical
Milestone

Comments

@michaelzhiluo
Copy link
Contributor

michaelzhiluo commented Nov 16, 2021

Search before asking

  • I searched the issues and found no similar issues.

Ray Component

Ray Clusters

What happened + What you expected to happen

Ray Autoscaler will relaunch the worker even if the head and worker node are both healthy and their file systems are identical.
This can be replicated by running ray up on most Autoscaler configuration files over and over again.

@concretevitamin @ericl

Versions / Dependencies

Most recent version of Ray and Ray Autoscaler.

Reproduction script

Autoscaler Config provided below. Run ray up -y config/aws-distributed.yml --no-config-cache one time and wait (important!) until the worker is fully setup via ray status. Rinse and repeat on the same configuration file. Eventually, on one of the runs, the Autoscaler will relaunch the worker node.

auth:
  ssh_user: ubuntu
available_node_types:
  ray.head.default:
    node_config:
      BlockDeviceMappings:
      - DeviceName: /dev/sda1
        Ebs:
          VolumeSize: 500
      ImageId: ami-04b343a85ab150b2d
      InstanceType: p3.2xlarge
    resources: {}
  ray.worker.default:
    max_workers: 1
    min_workers: 1
    node_config:
      BlockDeviceMappings:
      - DeviceName: /dev/sda1
        Ebs:
          VolumeSize: 500
      ImageId: ami-04b343a85ab150b2d
      InstanceType: p3.2xlarge
    resources: {}
cluster_name: temp-aws
docker:
  container_name: ''
  image: ''
  pull_before_run: true
  run_options:
  - --ulimit nofile=65536:65536
  - -p 8008:8008
file_mounts: {}
head_node_type: ray.head.default
head_start_ray_commands:
- ray stop
- ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml
idle_timeout_minutes: 5
initialization_commands: []
max_workers: 1
provider:
  cache_stopped_nodes: true
  region: us-east-2
  type: aws
rsync_exclude:
- '**/.git'
- '**/.git/**'
rsync_filter:
- .gitignore
setup_commands:
- pip3 install ray
- mkdir -p /tmp/workdir && cd /tmp/workdir && pip3 install --upgrade pip && pip3
  install ray[default]
upscaling_speed: 1.0
worker_start_ray_commands:
- ray stop
- ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

Anything else

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@michaelzhiluo michaelzhiluo added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 16, 2021
@DmitriGekhtman
Copy link
Contributor

By relaunches the worker do you mean restarts Ray on the worker?

Ray up restarts Ray across the cluster by default.
To avoid the restart, add the flag --no-restart.

Let me know if that makes sense / solves the issue.

@michaelzhiluo
Copy link
Contributor Author

Thanks for the quick reply! By relaunching the worker, I mean that the Autoscaler stops the EC2 worker node and restarts it again. Here is what we are trying to avoid in the image below. When we are running ray up again, we want to prevent the worker from relaunching from scratch.
Screen Shot 2021-11-15 at 6 45 38 PM
.

@DmitriGekhtman
Copy link
Contributor

DmitriGekhtman commented Nov 16, 2021

Got it.
Yeah, that's a bug. Could you post autoscaler logs after running ray up the second time? (ray monitor cluster.yaml, or the contents of /tmp/ray/session_latest/logs/monitor.*) Those should have some lines explaining why the worker was taken down.

@michaelzhiluo
Copy link
Contributor Author

2021-11-16 00:59:35,192 WARNING worker.py:1227 -- The actor or task with ID fd3463a596384e93cc2f4c914d291ea41a67faa92ae5ef73 cannot be scheduled right now. It requires {CPU: 1.000000}, {GPU: 1.000000}, {node:172.31.28.203: 1.000000} for placement, however the cluster currently cannot provide the requested resources. The required resources may be added as autoscaling takes place or placement groups are scheduled. Otherwise, consider reducing the resource requirements of the task.
(task-172.31.8.251 pid=24866) .
(task-172.31.8.251 pid=24866) ..
(autoscaler +20s) Tip: use `ray status` to view detailed autoscaling status. To disable autoscaler event messages, you can set AUTOSCALER_EVENTS=0.
(autoscaler +20s) Restarting 1 nodes of type ray.worker.default (lost contact with raylet).
(raylet, ip=172.31.28.203) E1116 00:59:18.757578546   13713 server_chttp2.cc:49]        {"created":"@1637024358.757519244","description":"No address added out of total 1 resolved","file":"external/com_github_grpc_grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":872,"referenced_errors":[{"created":"@1637024358.757513873","description":"Failed to add any wildcard listeners","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_posix.cc","file_line":340,"referenced_errors":[{"created":"@1637024358.757496636","description":"Unable to configure socket","fd":30,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1637024358.757490826","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1637024358.757512644","description":"Unable to configure socket","fd":30,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1637024358.757509731","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]}]}]}
(raylet, ip=172.31.28.203) [2021-11-16 00:59:18,803 C 13713 13713] grpc_server.cc:82:  Check failed: server_ Failed to start the grpc server. The specified port is 8076. This means that Ray's core components will not be able to function correctly. If the server startup error message is `Address already in use`, it indicates the server fails to start because the port is already used by other processes (such as --node-manager-port, --object-manager-port, --gcs-server-port, and ports between --min-worker-port, --max-worker-port). Try running lsof -i :8076 to check if there are other processes listening to the port.
(raylet, ip=172.31.28.203) *** StackTrace Information ***
(raylet, ip=172.31.28.203)     ray::SpdLogMessage::Flush()
(raylet, ip=172.31.28.203)     ray::RayLog::~RayLog()
(raylet, ip=172.31.28.203)     ray::rpc::GrpcServer::Run()
(raylet, ip=172.31.28.203)     ray::ObjectManager::ObjectManager()
(raylet, ip=172.31.28.203)     ray::raylet::NodeManager::NodeManager()
(raylet, ip=172.31.28.203)     ray::raylet::Raylet::Raylet()
(raylet, ip=172.31.28.203)     main::{lambda()#1}::operator()()
(raylet, ip=172.31.28.203)     std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.28.203)     std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.28.203)     std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.28.203)     ray::rpc::ClientCallImpl<>::OnReplyReceived()
(raylet, ip=172.31.28.203)     std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.28.203)     boost::asio::detail::completion_handler<>::do_complete()
(raylet, ip=172.31.28.203)     boost::asio::detail::scheduler::do_run_one()
(raylet, ip=172.31.28.203)     boost::asio::detail::scheduler::run()
(raylet, ip=172.31.28.203)     boost::asio::io_context::run()
(raylet, ip=172.31.28.203)     main
(raylet, ip=172.31.28.203)     __libc_start_main
(raylet, ip=172.31.28.203)
(raylet, ip=172.31.28.203) E1116 01:00:01.855151317   32369 server_chttp2.cc:49]        {"created":"@1637024401.855092228","description":"No address added out of total 1 resolved","file":"external/com_github_grpc_grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":872,"referenced_errors":[{"created":"@1637024401.855086343","description":"Failed to add any wildcard listeners","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_posix.cc","file_line":340,"referenced_errors":[{"created":"@1637024401.855067914","description":"Unable to configure socket","fd":30,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1637024401.855061141","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1637024401.855085081","description":"Unable to configure socket","fd":30,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1637024401.855082214","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]}]}]}
(raylet, ip=172.31.28.203) [2021-11-16 01:00:01,896 C 32369 32369] grpc_server.cc:82:  Check failed: server_ Failed to start the grpc server. The specified port is 8076. This means that Ray's core components will not be able to function correctly. If the server startup error message is `Address already in use`, it indicates the server fails to start because the port is already used by other processes (such as --node-manager-port, --object-manager-port, --gcs-server-port, and ports between --min-worker-port, --max-worker-port). Try running lsof -i :8076 to check if there are other processes listening to the port.
(raylet, ip=172.31.28.203) *** StackTrace Information ***
(raylet, ip=172.31.28.203)     ray::SpdLogMessage::Flush()
(raylet, ip=172.31.28.203)     ray::RayLog::~RayLog()
(raylet, ip=172.31.28.203)     ray::rpc::GrpcServer::Run()
(raylet, ip=172.31.28.203)     ray::ObjectManager::ObjectManager()
(raylet, ip=172.31.28.203)     ray::raylet::NodeManager::NodeManager()
(raylet, ip=172.31.28.203)     ray::raylet::Raylet::Raylet()
(raylet, ip=172.31.28.203)     main::{lambda()#1}::operator()()
(raylet, ip=172.31.28.203)     std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.28.203)     std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.28.203)     std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.28.203)     ray::rpc::ClientCallImpl<>::OnReplyReceived()
(raylet, ip=172.31.28.203)     std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.28.203)     boost::asio::detail::completion_handler<>::do_complete()
(raylet, ip=172.31.28.203)     boost::asio::detail::scheduler::do_run_one()
(raylet, ip=172.31.28.203)     boost::asio::detail::scheduler::run()
(raylet, ip=172.31.28.203)     boost::asio::io_context::run()
(raylet, ip=172.31.28.203)     main
(raylet, ip=172.31.28.203)     __libc_start_main
(raylet, ip=172.31.28.203)
(autoscaler +1m4s) Removing 1 nodes of type ray.worker.default (launch failed).
(autoscaler +1m9s) Adding 1 nodes of type ray.worker.default.

@DmitriGekhtman
Copy link
Contributor

had a typo in path: /tmp/ray/session_latest/logs/monitor.* on the head node for the autoscaler logs

those look like driver logs (as opposed to autoscaler logs)

Those logs are helpful, though.
What we're seeing is that ray on the worker is failing to get restarted, so the autoscaler freaks out and shuts the worker down before launching a new one to satisfy the min_workers constraint.

Logs for the the thread that is supposed to restart Ray on the worker I think are
/tmp/ray/session_latest/logs/monitor.out

@DmitriGekhtman
Copy link
Contributor

ok, seeing the weirdness with the default example configs

@DmitriGekhtman
Copy link
Contributor

DmitriGekhtman commented Nov 16, 2021

Ray start output when attempting to restart the worker's ray on the second ray up:

Local node IP: 10.0.1.18
[2021-11-15 23:05:38,600 I 224 224] global_state_accessor.cc:394: This node has an IP address of 10.0.1.18, while we can not found the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.

@DmitriGekhtman
Copy link
Contributor

DmitriGekhtman commented Nov 16, 2021

@kfstorm @wuisawesome What does the error message in the last comment mean? I see it mentions containers -- we do have those here.

@DmitriGekhtman DmitriGekhtman added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 16, 2021
@DmitriGekhtman DmitriGekhtman added this to the Serverless Autoscaling milestone Nov 16, 2021
@concretevitamin
Copy link
Contributor

Thanks for the investigation @DmitriGekhtman. FYI this showed up even if Docker is not used - e.g., docker: removed from the yaml.

@DmitriGekhtman DmitriGekhtman changed the title [Bug] [Ray Autoscaler] Ray Worker Node Relaunching during 'ray up' [Bug] [Ray Autoscaler] [Core] Ray Worker Node Relaunching during 'ray up' Nov 16, 2021
@kfstorm
Copy link
Member

kfstorm commented Nov 17, 2021

@kfstorm @wuisawesome What does the error message in the last comment mean? I see it mentions containers -- we do have those here.

I'm not sure about this. It seems that the registered IP address of Raylet doesn't match the one detected by the driver. So the driver cannot find the local Raylet instance to connect to.

@ConeyLiu Any thoughts?

@DmitriGekhtman
Copy link
Contributor

This looks pretty bad -- I'm seeing in this in other contexts where we try to restart Ray on a node.

@concretevitamin
Copy link
Contributor

Any update? We could work around this by delaying ray up the second time as much as possible. However at some point it does need to be run again.

@wuisawesome wuisawesome self-assigned this Jan 4, 2022
@DmitriGekhtman
Copy link
Contributor

Leaving this exclusively to @wuisawesome, since this issue appears to have a Ray-internal component, and that's a good enough reason to disqualify myself.

@DmitriGekhtman DmitriGekhtman removed their assignment Feb 1, 2022
@michaelzhiluo
Copy link
Contributor Author

Sgtm, we still encounter this issue pretty frequently and it'd be great if this issue is resolved soon.

@EricCousineau-TRI
Copy link
Contributor

Possibly related to #19834?

@EricCousineau-TRI
Copy link
Contributor

Yeah, fairly confident #19834 (comment) is related

Basically, yeah, restarting ray on workers makes the worker + head nodes sad.

Is this because ray stop for worker_start_ray_commands may not always stop ray correctly? perhaps it leaves a lingering raylet?
https://docs.ray.io/en/releases-1.9.2/cluster/config.html#cluster-configuration-worker-start-ray-commands

@EricCousineau-TRI
Copy link
Contributor

A dumb workaround is to try and issue an extra ray stop || true command before running ray up. Doesn't seem perfect, but lowers chance of running into this:
https://github.com/EricCousineau-TRI/repro/blob/b63b25f4683dd0afd7582748c2adfe7dc8aa0c6f/python/ray_example/run_all.sh#L20-L21

See surround code + files for full repro.

@EricCousineau-TRI
Copy link
Contributor

Yeah, not sure if this is new info, but ray stop does not always seem to stop the server. Confirmed that in a setup I was just running, using my hacky ray_exec_all script. Output:
https://gist.github.com/EricCousineau-TRI/f2f67c488b75956bbb9d105cc4794ebc#file-ray-stop-failure-sh-L40-L58

Script: https://github.com/EricCousineau-TRI/repro/blob/b63b25f4683dd0afd7582748c2adfe7dc8aa0c6f/python/ray_example/ray_exec_all.py

@AmeerHajAli AmeerHajAli added the infra autoscaler, ray client, kuberay, related issues label Mar 26, 2022
@hora-anyscale hora-anyscale added P2 Important issue, but not time-critical and removed P1 Issue that should be fixed within a few weeks labels Dec 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't infra autoscaler, ray client, kuberay, related issues P2 Important issue, but not time-critical
Projects
None yet
Development

No branches or pull requests

8 participants