[Bug] [Ray Autoscaler] [Core] Ray Worker Node Relaunching during 'ray up' #20402

michaelzhiluo · 2021-11-16T01:18:18Z

Search before asking

I searched the issues and found no similar issues.

Ray Component

Ray Clusters

What happened + What you expected to happen

Ray Autoscaler will relaunch the worker even if the head and worker node are both healthy and their file systems are identical.
This can be replicated by running ray up on most Autoscaler configuration files over and over again.

@concretevitamin @ericl

Versions / Dependencies

Most recent version of Ray and Ray Autoscaler.

Reproduction script

Autoscaler Config provided below. Run ray up -y config/aws-distributed.yml --no-config-cache one time and wait (important!) until the worker is fully setup via ray status. Rinse and repeat on the same configuration file. Eventually, on one of the runs, the Autoscaler will relaunch the worker node.

auth:
  ssh_user: ubuntu
available_node_types:
  ray.head.default:
    node_config:
      BlockDeviceMappings:
      - DeviceName: /dev/sda1
        Ebs:
          VolumeSize: 500
      ImageId: ami-04b343a85ab150b2d
      InstanceType: p3.2xlarge
    resources: {}
  ray.worker.default:
    max_workers: 1
    min_workers: 1
    node_config:
      BlockDeviceMappings:
      - DeviceName: /dev/sda1
        Ebs:
          VolumeSize: 500
      ImageId: ami-04b343a85ab150b2d
      InstanceType: p3.2xlarge
    resources: {}
cluster_name: temp-aws
docker:
  container_name: ''
  image: ''
  pull_before_run: true
  run_options:
  - --ulimit nofile=65536:65536
  - -p 8008:8008
file_mounts: {}
head_node_type: ray.head.default
head_start_ray_commands:
- ray stop
- ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml
idle_timeout_minutes: 5
initialization_commands: []
max_workers: 1
provider:
  cache_stopped_nodes: true
  region: us-east-2
  type: aws
rsync_exclude:
- '**/.git'
- '**/.git/**'
rsync_filter:
- .gitignore
setup_commands:
- pip3 install ray
- mkdir -p /tmp/workdir && cd /tmp/workdir && pip3 install --upgrade pip && pip3
  install ray[default]
upscaling_speed: 1.0
worker_start_ray_commands:
- ray stop
- ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076

Anything else

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

DmitriGekhtman · 2021-11-16T02:32:48Z

By relaunches the worker do you mean restarts Ray on the worker?

Ray up restarts Ray across the cluster by default.
To avoid the restart, add the flag --no-restart.

Let me know if that makes sense / solves the issue.

michaelzhiluo · 2021-11-16T02:46:11Z

Thanks for the quick reply! By relaunching the worker, I mean that the Autoscaler stops the EC2 worker node and restarts it again. Here is what we are trying to avoid in the image below. When we are running ray up again, we want to prevent the worker from relaunching from scratch.

.

DmitriGekhtman · 2021-11-16T05:35:38Z

Got it.
Yeah, that's a bug. Could you post autoscaler logs after running ray up the second time? (ray monitor cluster.yaml, or the contents of /tmp/ray/session_latest/logs/monitor.*) Those should have some lines explaining why the worker was taken down.

michaelzhiluo · 2021-11-16T05:45:17Z

2021-11-16 00:59:35,192 WARNING worker.py:1227 -- The actor or task with ID fd3463a596384e93cc2f4c914d291ea41a67faa92ae5ef73 cannot be scheduled right now. It requires {CPU: 1.000000}, {GPU: 1.000000}, {node:172.31.28.203: 1.000000} for placement, however the cluster currently cannot provide the requested resources. The required resources may be added as autoscaling takes place or placement groups are scheduled. Otherwise, consider reducing the resource requirements of the task.
(task-172.31.8.251 pid=24866) .
(task-172.31.8.251 pid=24866) ..
(autoscaler +20s) Tip: use `ray status` to view detailed autoscaling status. To disable autoscaler event messages, you can set AUTOSCALER_EVENTS=0.
(autoscaler +20s) Restarting 1 nodes of type ray.worker.default (lost contact with raylet).
(raylet, ip=172.31.28.203) E1116 00:59:18.757578546   13713 server_chttp2.cc:49]        {"created":"@1637024358.757519244","description":"No address added out of total 1 resolved","file":"external/com_github_grpc_grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":872,"referenced_errors":[{"created":"@1637024358.757513873","description":"Failed to add any wildcard listeners","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_posix.cc","file_line":340,"referenced_errors":[{"created":"@1637024358.757496636","description":"Unable to configure socket","fd":30,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1637024358.757490826","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1637024358.757512644","description":"Unable to configure socket","fd":30,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1637024358.757509731","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]}]}]}
(raylet, ip=172.31.28.203) [2021-11-16 00:59:18,803 C 13713 13713] grpc_server.cc:82:  Check failed: server_ Failed to start the grpc server. The specified port is 8076. This means that Ray's core components will not be able to function correctly. If the server startup error message is `Address already in use`, it indicates the server fails to start because the port is already used by other processes (such as --node-manager-port, --object-manager-port, --gcs-server-port, and ports between --min-worker-port, --max-worker-port). Try running lsof -i :8076 to check if there are other processes listening to the port.
(raylet, ip=172.31.28.203) *** StackTrace Information ***
(raylet, ip=172.31.28.203)     ray::SpdLogMessage::Flush()
(raylet, ip=172.31.28.203)     ray::RayLog::~RayLog()
(raylet, ip=172.31.28.203)     ray::rpc::GrpcServer::Run()
(raylet, ip=172.31.28.203)     ray::ObjectManager::ObjectManager()
(raylet, ip=172.31.28.203)     ray::raylet::NodeManager::NodeManager()
(raylet, ip=172.31.28.203)     ray::raylet::Raylet::Raylet()
(raylet, ip=172.31.28.203)     main::{lambda()#1}::operator()()
(raylet, ip=172.31.28.203)     std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.28.203)     std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.28.203)     std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.28.203)     ray::rpc::ClientCallImpl<>::OnReplyReceived()
(raylet, ip=172.31.28.203)     std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.28.203)     boost::asio::detail::completion_handler<>::do_complete()
(raylet, ip=172.31.28.203)     boost::asio::detail::scheduler::do_run_one()
(raylet, ip=172.31.28.203)     boost::asio::detail::scheduler::run()
(raylet, ip=172.31.28.203)     boost::asio::io_context::run()
(raylet, ip=172.31.28.203)     main
(raylet, ip=172.31.28.203)     __libc_start_main
(raylet, ip=172.31.28.203)
(raylet, ip=172.31.28.203) E1116 01:00:01.855151317   32369 server_chttp2.cc:49]        {"created":"@1637024401.855092228","description":"No address added out of total 1 resolved","file":"external/com_github_grpc_grpc/src/core/ext/transport/chttp2/server/chttp2_server.cc","file_line":872,"referenced_errors":[{"created":"@1637024401.855086343","description":"Failed to add any wildcard listeners","file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_posix.cc","file_line":340,"referenced_errors":[{"created":"@1637024401.855067914","description":"Unable to configure socket","fd":30,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1637024401.855061141","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]},{"created":"@1637024401.855085081","description":"Unable to configure socket","fd":30,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":216,"referenced_errors":[{"created":"@1637024401.855082214","description":"Address already in use","errno":98,"file":"external/com_github_grpc_grpc/src/core/lib/iomgr/tcp_server_utils_posix_common.cc","file_line":190,"os_error":"Address already in use","syscall":"bind"}]}]}]}
(raylet, ip=172.31.28.203) [2021-11-16 01:00:01,896 C 32369 32369] grpc_server.cc:82:  Check failed: server_ Failed to start the grpc server. The specified port is 8076. This means that Ray's core components will not be able to function correctly. If the server startup error message is `Address already in use`, it indicates the server fails to start because the port is already used by other processes (such as --node-manager-port, --object-manager-port, --gcs-server-port, and ports between --min-worker-port, --max-worker-port). Try running lsof -i :8076 to check if there are other processes listening to the port.
(raylet, ip=172.31.28.203) *** StackTrace Information ***
(raylet, ip=172.31.28.203)     ray::SpdLogMessage::Flush()
(raylet, ip=172.31.28.203)     ray::RayLog::~RayLog()
(raylet, ip=172.31.28.203)     ray::rpc::GrpcServer::Run()
(raylet, ip=172.31.28.203)     ray::ObjectManager::ObjectManager()
(raylet, ip=172.31.28.203)     ray::raylet::NodeManager::NodeManager()
(raylet, ip=172.31.28.203)     ray::raylet::Raylet::Raylet()
(raylet, ip=172.31.28.203)     main::{lambda()#1}::operator()()
(raylet, ip=172.31.28.203)     std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.28.203)     std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.28.203)     std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.28.203)     ray::rpc::ClientCallImpl<>::OnReplyReceived()
(raylet, ip=172.31.28.203)     std::_Function_handler<>::_M_invoke()
(raylet, ip=172.31.28.203)     boost::asio::detail::completion_handler<>::do_complete()
(raylet, ip=172.31.28.203)     boost::asio::detail::scheduler::do_run_one()
(raylet, ip=172.31.28.203)     boost::asio::detail::scheduler::run()
(raylet, ip=172.31.28.203)     boost::asio::io_context::run()
(raylet, ip=172.31.28.203)     main
(raylet, ip=172.31.28.203)     __libc_start_main
(raylet, ip=172.31.28.203)
(autoscaler +1m4s) Removing 1 nodes of type ray.worker.default (launch failed).
(autoscaler +1m9s) Adding 1 nodes of type ray.worker.default.

DmitriGekhtman · 2021-11-16T06:46:32Z

had a typo in path: /tmp/ray/session_latest/logs/monitor.* on the head node for the autoscaler logs

those look like driver logs (as opposed to autoscaler logs)

Those logs are helpful, though.
What we're seeing is that ray on the worker is failing to get restarted, so the autoscaler freaks out and shuts the worker down before launching a new one to satisfy the min_workers constraint.

Logs for the the thread that is supposed to restart Ray on the worker I think are
/tmp/ray/session_latest/logs/monitor.out

DmitriGekhtman · 2021-11-16T07:07:36Z

ok, seeing the weirdness with the default example configs

DmitriGekhtman · 2021-11-16T07:15:55Z

Ray start output when attempting to restart the worker's ray on the second ray up:

Local node IP: 10.0.1.18
[2021-11-15 23:05:38,600 I 224 224] global_state_accessor.cc:394: This node has an IP address of 10.0.1.18, while we can not found the matched Raylet address. This maybe come from when you connect the Ray cluster with a different IP address or connect a container.

DmitriGekhtman · 2021-11-16T07:28:53Z

@kfstorm @wuisawesome What does the error message in the last comment mean? I see it mentions containers -- we do have those here.

concretevitamin · 2021-11-16T17:03:49Z

Thanks for the investigation @DmitriGekhtman. FYI this showed up even if Docker is not used - e.g., docker: removed from the yaml.

kfstorm · 2021-11-17T08:49:41Z

@kfstorm @wuisawesome What does the error message in the last comment mean? I see it mentions containers -- we do have those here.

I'm not sure about this. It seems that the registered IP address of Raylet doesn't match the one detected by the driver. So the driver cannot find the local Raylet instance to connect to.

@ConeyLiu Any thoughts?

DmitriGekhtman · 2021-11-20T21:00:26Z

This looks pretty bad -- I'm seeing in this in other contexts where we try to restart Ray on a node.

concretevitamin · 2021-12-07T21:08:55Z

Any update? We could work around this by delaying ray up the second time as much as possible. However at some point it does need to be run again.

DmitriGekhtman · 2022-02-01T03:59:05Z

Leaving this exclusively to @wuisawesome, since this issue appears to have a Ray-internal component, and that's a good enough reason to disqualify myself.

michaelzhiluo · 2022-02-01T04:50:13Z

Sgtm, we still encounter this issue pretty frequently and it'd be great if this issue is resolved soon.

EricCousineau-TRI · 2022-03-01T01:25:38Z

Possibly related to #19834?

EricCousineau-TRI · 2022-03-01T02:02:54Z

Yeah, fairly confident #19834 (comment) is related

Basically, yeah, restarting ray on workers makes the worker + head nodes sad.

Is this because ray stop for worker_start_ray_commands may not always stop ray correctly? perhaps it leaves a lingering raylet?
https://docs.ray.io/en/releases-1.9.2/cluster/config.html#cluster-configuration-worker-start-ray-commands

EricCousineau-TRI · 2022-03-01T18:52:20Z

A dumb workaround is to try and issue an extra ray stop || true command before running ray up. Doesn't seem perfect, but lowers chance of running into this:
https://github.com/EricCousineau-TRI/repro/blob/b63b25f4683dd0afd7582748c2adfe7dc8aa0c6f/python/ray_example/run_all.sh#L20-L21

See surround code + files for full repro.

EricCousineau-TRI · 2022-03-04T11:46:14Z

Yeah, not sure if this is new info, but ray stop does not always seem to stop the server. Confirmed that in a setup I was just running, using my hacky ray_exec_all script. Output:
https://gist.github.com/EricCousineau-TRI/f2f67c488b75956bbb9d105cc4794ebc#file-ray-stop-failure-sh-L40-L58

Script: https://github.com/EricCousineau-TRI/repro/blob/b63b25f4683dd0afd7582748c2adfe7dc8aa0c6f/python/ray_example/ray_exec_all.py

michaelzhiluo added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 16, 2021

michaelzhiluo assigned AmeerHajAli and DmitriGekhtman Nov 16, 2021

DmitriGekhtman unassigned AmeerHajAli Nov 16, 2021

DmitriGekhtman added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Nov 16, 2021

DmitriGekhtman added this to the Serverless Autoscaling milestone Nov 16, 2021

DmitriGekhtman changed the title ~~[Bug] [Ray Autoscaler] Ray Worker Node Relaunching during 'ray up'~~ [Bug] [Ray Autoscaler] [Core] Ray Worker Node Relaunching during 'ray up' Nov 16, 2021

concretevitamin mentioned this issue Nov 24, 2021

No /tmp/workdir when running examples/multi_hostname.py twice skypilot-org/skypilot#61

Closed

wuisawesome self-assigned this Jan 4, 2022

DmitriGekhtman removed their assignment Feb 1, 2022

concretevitamin mentioned this issue Feb 28, 2022

[backend] Failed-to-setup clusters have no Ray processes after fixing setup skypilot-org/skypilot#140

Closed

bveeramani mentioned this issue Mar 14, 2022

[autoscaler][Bug] Raylet address already in use during worker node startup #23152

Closed

2 tasks

AmeerHajAli added the infra autoscaler, ray client, kuberay, related issues label Mar 26, 2022

DmitriGekhtman mentioned this issue May 10, 2022

Initializing cluster on SLURM causes "we can not found the matched Raylet address" warning #17491

Closed

1 task

hora-anyscale added P2 Important issue, but not time-critical and removed P1 Issue that should be fixed within a few weeks labels Dec 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] [Ray Autoscaler] [Core] Ray Worker Node Relaunching during 'ray up' #20402

[Bug] [Ray Autoscaler] [Core] Ray Worker Node Relaunching during 'ray up' #20402

michaelzhiluo commented Nov 16, 2021 •

edited

Loading

DmitriGekhtman commented Nov 16, 2021

michaelzhiluo commented Nov 16, 2021

DmitriGekhtman commented Nov 16, 2021 •

edited

Loading

michaelzhiluo commented Nov 16, 2021

DmitriGekhtman commented Nov 16, 2021

DmitriGekhtman commented Nov 16, 2021

DmitriGekhtman commented Nov 16, 2021 •

edited

Loading

DmitriGekhtman commented Nov 16, 2021 •

edited

Loading

concretevitamin commented Nov 16, 2021

kfstorm commented Nov 17, 2021

DmitriGekhtman commented Nov 20, 2021

concretevitamin commented Dec 7, 2021

DmitriGekhtman commented Feb 1, 2022

michaelzhiluo commented Feb 1, 2022

EricCousineau-TRI commented Mar 1, 2022

EricCousineau-TRI commented Mar 1, 2022

EricCousineau-TRI commented Mar 1, 2022

EricCousineau-TRI commented Mar 4, 2022

[Bug] [Ray Autoscaler] [Core] Ray Worker Node Relaunching during 'ray up' #20402

[Bug] [Ray Autoscaler] [Core] Ray Worker Node Relaunching during 'ray up' #20402

Comments

michaelzhiluo commented Nov 16, 2021 • edited Loading

Search before asking

Ray Component

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Anything else

Are you willing to submit a PR?

DmitriGekhtman commented Nov 16, 2021

michaelzhiluo commented Nov 16, 2021

DmitriGekhtman commented Nov 16, 2021 • edited Loading

michaelzhiluo commented Nov 16, 2021

DmitriGekhtman commented Nov 16, 2021

DmitriGekhtman commented Nov 16, 2021

DmitriGekhtman commented Nov 16, 2021 • edited Loading

DmitriGekhtman commented Nov 16, 2021 • edited Loading

concretevitamin commented Nov 16, 2021

kfstorm commented Nov 17, 2021

DmitriGekhtman commented Nov 20, 2021

concretevitamin commented Dec 7, 2021

DmitriGekhtman commented Feb 1, 2022

michaelzhiluo commented Feb 1, 2022

EricCousineau-TRI commented Mar 1, 2022

EricCousineau-TRI commented Mar 1, 2022

EricCousineau-TRI commented Mar 1, 2022

EricCousineau-TRI commented Mar 4, 2022

michaelzhiluo commented Nov 16, 2021 •

edited

Loading

DmitriGekhtman commented Nov 16, 2021 •

edited

Loading

DmitriGekhtman commented Nov 16, 2021 •

edited

Loading

DmitriGekhtman commented Nov 16, 2021 •

edited

Loading