Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-scale ray clusters based on GCS load metrics #1348

Merged
merged 36 commits into from
Dec 31, 2017

Conversation

ericl
Copy link
Contributor

@ericl ericl commented Dec 19, 2017

What do these changes do?

This adds (experimental) auto-scaling support for Ray clusters based on GCS load metrics. The auto-scaling algorithm is as follows:

  • Based on current (instantaneous) load information, we compute the approximate number of "used workers". This is based on the bottleneck resource, e.g. if 8/8 GPUs are used in a 8-node cluster but all the CPUs are idle, the number of used nodes is still counted as 8. This number can also be fractional.
  • We scale that number by 1 / target_utilization_fraction and round up to determine the target cluster size (subject to the max_workers constraint). The autoscaler control loop takes care of launching new nodes until the target cluster size is met.
  • When a node is idle for more than idle_timeout_minutes, we remove it from the cluster if that would not drop the cluster size below min_workers.

Note that we'll need to update the wheel in the example yaml file after this PR is merged.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2849/
Test PASSed.

@robertnishihara
Copy link
Collaborator

robertnishihara commented Dec 20, 2017

Just tried this with the default config

ray create_or_update ray/python/ray/autoscaler/aws/example.yaml

It successfully started a node, but then it printed

NodeUpdater: Applied config a6565145497ee0bcb66896453b9a61a488f47020 to node i-0d4016ff8eecf92ff
Head node up-to-date, IP address is: None
To monitor auto-scaling activity, you can run:

  ssh -i /Users/rkn/.ssh/ray-autoscaler_1.pem ubuntu@None 'tail -f /tmp/raylogs/monitor-*'

for some reason the IP address is None.

@@ -64,6 +74,93 @@
# Max number of nodes to launch at a time.
MAX_CONCURRENT_LAUNCHES = 10

# Print debug string once every this many seconds
DEBUG_INTERVAL_S = 5
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These constants MAX_NUM_FAILURES, MAX_CONCURRENT_LAUNCHES , and DEBUG_INTERVAL_S should go in https://github.com/ray-project/ray/blob/master/src/common/state/ray_config.h. All constants are there to simplify debugging (these kinds of constants often need to be changed during debugging) so that you don't have to track down the constants throughout the whole codebase (these are not considered exposed to users).

If you want I can make this change.

Copy link
Contributor Author

@ericl ericl Dec 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you access them from Python? I'd prefer to keep Python constants in .py files, since it avoids recompiles.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see how it's done in #1192, it actually involves a number of files to expose it to Python.

Good point about recompilation. Maybe we should have two files. One for C constants and one for Python constants. What do you think about creating ray/python/ray/constants.py?

@robertnishihara
Copy link
Collaborator

I'm getting timeout errors on the current master

$ ray create_or_update ~/Workspace/ray/python/ray/autoscaler/aws/example.yaml 
Role not specified for head node, using arn:aws:iam::339530224232:instance-profile/ray-autoscaler
KeyName not specified for nodes, using ray-autoscaler_1
SubnetId not specified for head node, using subnet-8e489ed7 in us-west-2c
SubnetId not specified for workers, using subnet-8e489ed7 in us-west-2c
SecurityGroupIds not specified for head node, using ray-autoscaler-default
SecurityGroupIds not specified for workers, using ray-autoscaler-default
Launching new head node...
Updating files on head node...
NodeUpdater: Updating i-01596b463d5700cc1 to 8f0512ae89aaeb7304fb593120a29f313cce7691, logging to (console)
NodeUpdater: Waiting for IP of i-01596b463d5700cc1...
NodeUpdater: Waiting for SSH to i-01596b463d5700cc1...
NodeUpdater: SSH not up, retrying: 
NodeUpdater: Waiting for SSH to i-01596b463d5700cc1...
NodeUpdater: SSH not up, retrying: 
NodeUpdater: Waiting for SSH to i-01596b463d5700cc1...
NodeUpdater: SSH not up, retrying: Command '['ssh', '-o', 'ConnectTimeout=5s', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/rkn/.ssh/ray-autoscaler_1.pem', 'ubuntu@35.162.178.19', 'uptime']' returned non-zero exit status 255.
NodeUpdater: Waiting for SSH to i-01596b463d5700cc1...
NodeUpdater: SSH not up, retrying: Command '['ssh', '-o', 'ConnectTimeout=5s', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/rkn/.ssh/ray-autoscaler_1.pem', 'ubuntu@35.162.178.19', 'uptime']' returned non-zero exit status 255.
NodeUpdater: Waiting for SSH to i-01596b463d5700cc1...
NodeUpdater: SSH not up, retrying: Command '['ssh', '-o', 'ConnectTimeout=5s', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/rkn/.ssh/ray-autoscaler_1.pem', 'ubuntu@35.162.178.19', 'uptime']' returned non-zero exit status 255.
NodeUpdater: Waiting for SSH to i-01596b463d5700cc1...
NodeUpdater: SSH not up, retrying: Command '['ssh', '-o', 'ConnectTimeout=5s', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/rkn/.ssh/ray-autoscaler_1.pem', 'ubuntu@35.162.178.19', 'uptime']' returned non-zero exit status 255.
NodeUpdater: Waiting for SSH to i-01596b463d5700cc1...
NodeUpdater: SSH not up, retrying: Command '['ssh', '-o', 'ConnectTimeout=5s', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/rkn/.ssh/ray-autoscaler_1.pem', 'ubuntu@35.162.178.19', 'uptime']' returned non-zero exit status 255.
NodeUpdater: Waiting for SSH to i-01596b463d5700cc1...
NodeUpdater: SSH not up, retrying: Command '['ssh', '-o', 'ConnectTimeout=5s', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/rkn/.ssh/ray-autoscaler_1.pem', 'ubuntu@35.162.178.19', 'uptime']' returned non-zero exit status 255.
NodeUpdater: Waiting for SSH to i-01596b463d5700cc1...
NodeUpdater: SSH not up, retrying: Command '['ssh', '-o', 'ConnectTimeout=5s', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/rkn/.ssh/ray-autoscaler_1.pem', 'ubuntu@35.162.178.19', 'uptime']' returned non-zero exit status 255.
NodeUpdater: Waiting for SSH to i-01596b463d5700cc1...
NodeUpdater: SSH not up, retrying: Command '['ssh', '-o', 'ConnectTimeout=5s', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/rkn/.ssh/ray-autoscaler_1.pem', 'ubuntu@35.162.178.19', 'uptime']' returned non-zero exit status 255.
NodeUpdater: Waiting for SSH to i-01596b463d5700cc1...
NodeUpdater: SSH not up, retrying: Command '['ssh', '-o', 'ConnectTimeout=5s', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/rkn/.ssh/ray-autoscaler_1.pem', 'ubuntu@35.162.178.19', 'uptime']' returned non-zero exit status 255.
NodeUpdater: Waiting for SSH to i-01596b463d5700cc1...
NodeUpdater: SSH not up, retrying: Command '['ssh', '-o', 'ConnectTimeout=5s', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/rkn/.ssh/ray-autoscaler_1.pem', 'ubuntu@35.162.178.19', 'uptime']' returned non-zero exit status 255.
NodeUpdater: Waiting for SSH to i-01596b463d5700cc1...
NodeUpdater: SSH not up, retrying: Command '['ssh', '-o', 'ConnectTimeout=5s', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/rkn/.ssh/ray-autoscaler_1.pem', 'ubuntu@35.162.178.19', 'uptime']' returned non-zero exit status 255.
NodeUpdater: Waiting for SSH to i-01596b463d5700cc1...
NodeUpdater: SSH not up, retrying: Command '['ssh', '-o', 'ConnectTimeout=5s', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/rkn/.ssh/ray-autoscaler_1.pem', 'ubuntu@35.162.178.19', 'uptime']' returned non-zero exit status 255.
NodeUpdater: Waiting for SSH to i-01596b463d5700cc1...
NodeUpdater: SSH not up, retrying: Command '['ssh', '-o', 'ConnectTimeout=5s', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/rkn/.ssh/ray-autoscaler_1.pem', 'ubuntu@35.162.178.19', 'uptime']' returned non-zero exit status 255.
NodeUpdater: Waiting for SSH to i-01596b463d5700cc1...
NodeUpdater: SSH not up, retrying: Command '['ssh', '-o', 'ConnectTimeout=5s', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/rkn/.ssh/ray-autoscaler_1.pem', 'ubuntu@35.162.178.19', 'uptime']' returned non-zero exit status 255.
NodeUpdater: Waiting for SSH to i-01596b463d5700cc1...
NodeUpdater: SSH not up, retrying: Command '['ssh', '-o', 'ConnectTimeout=5s', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/rkn/.ssh/ray-autoscaler_1.pem', 'ubuntu@35.162.178.19', 'uptime']' returned non-zero exit status 255.
NodeUpdater: Waiting for SSH to i-01596b463d5700cc1...
NodeUpdater: SSH not up, retrying: Command '['ssh', '-o', 'ConnectTimeout=5s', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/rkn/.ssh/ray-autoscaler_1.pem', 'ubuntu@35.162.178.19', 'uptime']' returned non-zero exit status 255.
NodeUpdater: Waiting for SSH to i-01596b463d5700cc1...
NodeUpdater: SSH not up, retrying: Command '['ssh', '-o', 'ConnectTimeout=5s', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/rkn/.ssh/ray-autoscaler_1.pem', 'ubuntu@35.162.178.19', 'uptime']' returned non-zero exit status 255.
NodeUpdater: Waiting for SSH to i-01596b463d5700cc1...
NodeUpdater: SSH not up, retrying: Command '['ssh', '-o', 'ConnectTimeout=5s', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/rkn/.ssh/ray-autoscaler_1.pem', 'ubuntu@35.162.178.19', 'uptime']' returned non-zero exit status 255.
NodeUpdater: Waiting for SSH to i-01596b463d5700cc1...
NodeUpdater: SSH not up, retrying: Command '['ssh', '-o', 'ConnectTimeout=5s', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/rkn/.ssh/ray-autoscaler_1.pem', 'ubuntu@35.162.178.19', 'uptime']' returned non-zero exit status 255.
NodeUpdater: Waiting for SSH to i-01596b463d5700cc1...
NodeUpdater: SSH not up, retrying: Command '['ssh', '-o', 'ConnectTimeout=5s', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/rkn/.ssh/ray-autoscaler_1.pem', 'ubuntu@35.162.178.19', 'uptime']' returned non-zero exit status 255.
NodeUpdater: Waiting for SSH to i-01596b463d5700cc1...
NodeUpdater: SSH not up, retrying: Command '['ssh', '-o', 'ConnectTimeout=5s', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/rkn/.ssh/ray-autoscaler_1.pem', 'ubuntu@35.162.178.19', 'uptime']' returned non-zero exit status 255.
NodeUpdater: Waiting for SSH to i-01596b463d5700cc1...
NodeUpdater: SSH not up, retrying: Command '['ssh', '-o', 'ConnectTimeout=5s', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/rkn/.ssh/ray-autoscaler_1.pem', 'ubuntu@35.162.178.19', 'uptime']' returned non-zero exit status 255.
NodeUpdater: Waiting for SSH to i-01596b463d5700cc1...
NodeUpdater: SSH not up, retrying: Command '['ssh', '-o', 'ConnectTimeout=5s', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/rkn/.ssh/ray-autoscaler_1.pem', 'ubuntu@35.162.178.19', 'uptime']' returned non-zero exit status 255.
NodeUpdater: Waiting for SSH to i-01596b463d5700cc1...
NodeUpdater: SSH not up, retrying: Command '['ssh', '-o', 'ConnectTimeout=5s', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/rkn/.ssh/ray-autoscaler_1.pem', 'ubuntu@35.162.178.19', 'uptime']' returned non-zero exit status 255.
NodeUpdater: Waiting for SSH to i-01596b463d5700cc1...
NodeUpdater: SSH not up, retrying: Command '['ssh', '-o', 'ConnectTimeout=5s', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/rkn/.ssh/ray-autoscaler_1.pem', 'ubuntu@35.162.178.19', 'uptime']' returned non-zero exit status 255.
NodeUpdater: Waiting for SSH to i-01596b463d5700cc1...
NodeUpdater: SSH not up, retrying: Command '['ssh', '-o', 'ConnectTimeout=5s', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/rkn/.ssh/ray-autoscaler_1.pem', 'ubuntu@35.162.178.19', 'uptime']' returned non-zero exit status 255.
NodeUpdater: Waiting for SSH to i-01596b463d5700cc1...
NodeUpdater: SSH not up, retrying: Command '['ssh', '-o', 'ConnectTimeout=5s', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/rkn/.ssh/ray-autoscaler_1.pem', 'ubuntu@35.162.178.19', 'uptime']' returned non-zero exit status 255.
NodeUpdater: Waiting for SSH to i-01596b463d5700cc1...
NodeUpdater: SSH not up, retrying: Command '['ssh', '-o', 'ConnectTimeout=5s', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/rkn/.ssh/ray-autoscaler_1.pem', 'ubuntu@35.162.178.19', 'uptime']' returned non-zero exit status 255.
NodeUpdater: Waiting for SSH to i-01596b463d5700cc1...
NodeUpdater: SSH not up, retrying: Command '['ssh', '-o', 'ConnectTimeout=5s', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/rkn/.ssh/ray-autoscaler_1.pem', 'ubuntu@35.162.178.19', 'uptime']' returned non-zero exit status 255.
NodeUpdater: Syncing /Users/rkn/.ssh/ray-autoscaler_1.pem to ~/ray_bootstrap_key.pem...
ssh: connect to host 35.162.178.19 port 22: Operation timed out
NodeUpdater: Error updating Command '['ssh', '-o', 'ConnectTimeout=60s', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/rkn/.ssh/ray-autoscaler_1.pem', 'ubuntu@35.162.178.19', 'mkdir -p ~']' returned non-zero exit status 255., see (console) for remote logs
Process NodeUpdaterProcess-1:
Traceback (most recent call last):
  File "/Users/rkn/anaconda3/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/Users/rkn/Workspace/ray/python/ray/autoscaler/updater.py", line 67, in run
    raise e
  File "/Users/rkn/Workspace/ray/python/ray/autoscaler/updater.py", line 54, in run
    self.do_update()
  File "/Users/rkn/Workspace/ray/python/ray/autoscaler/updater.py", line 132, in do_update
    "mkdir -p {}".format(os.path.dirname(remote_path)))
  File "/Users/rkn/Workspace/ray/python/ray/autoscaler/updater.py", line 158, in ssh_cmd
    ], stdout=redirect or self.stdout, stderr=redirect or self.stderr)
  File "/Users/rkn/anaconda3/lib/python3.6/subprocess.py", line 291, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ssh', '-o', 'ConnectTimeout=60s', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/rkn/.ssh/ray-autoscaler_1.pem', 'ubuntu@35.162.178.19', 'mkdir -p ~']' returned non-zero exit status 255.
Error: updating 35.162.178.19 failed

It seems a bit opaque to me. Any idea about this?

@ericl
Copy link
Contributor Author

ericl commented Dec 21, 2017

Hmm, when you run the SSH command manually does it work? It looks like it can't reach the instance for some reason.

@robertnishihara
Copy link
Collaborator

@ericl I just tried again and this time it succeeded. I'll let you know if I can reproduce the original problem.

@@ -4,8 +4,10 @@

import json
import hashlib
import math
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we just use numpy everywhere?

@ericl ericl force-pushed the load-metrics branch 2 times, most recently from fd9e762 to 6172f78 Compare December 21, 2017 08:41
@robertnishihara
Copy link
Collaborator

robertnishihara commented Dec 28, 2017

Oh, I just noticed you removed the line

echo 'export PATH="$HOME/anaconda3/bin:$PATH"' >> ~/.bashrc

adding that back in along with

source ~/.bashrc

should fix it. I'm trying this out and will push the diff if it works

@ericl
Copy link
Contributor Author

ericl commented Dec 29, 2017

Oh ic, I removed that since I thought it was unneeded. Where does that path entry come from on login if not bashrc?

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3001/
Test PASSed.

@robertnishihara
Copy link
Collaborator

I see, I'll remove the source thing then. The line echo 'export PATH="$HOME/anaconda3/bin:$PATH"' >> ~/.bashrc is usually run by the Anaconda installation script (it prompts you when run interactively and asks if you want to add it to your PATH). However, when running it non-interactively, it just doesn't happen.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3003/
Test PASSed.

# Install basics.
- sudo apt-get update
- sudo apt-get install -y cmake pkg-config build-essential autoconf curl libtool unzip python
# TODO(ekl): are these commands idempotent?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are definitely not idempotent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some || trues which should fix it for now.

@robertnishihara
Copy link
Collaborator

robertnishihara commented Dec 29, 2017

@ericl Just ran out of disk space on the cluster while testing this. Is there a way to increase disk space through the yaml file?

@@ -850,6 +850,7 @@ def start_worker(node_ip_address, object_store_name, object_store_manager_name,
default.
"""
command = [sys.executable,
"-u",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great idea :)


# The autoscaler will attempt to restart Ray on nodes it hasn't heard from
# in more than this interval.
HEARTBEAT_TIMEOUT_S = 30
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think it'd be a good idea to have a separate file with all of the constants. That way people don't have to search through all the files to find all the constants they need to change while debugging. What do you think about putting these in a separate ray_constants.py file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -123,43 +247,59 @@ def update(self):
raise e

def _update(self):
# Throttle autoscaling updates to this interval to avoid exceeding
# rate limits on API calls.
if time.time() - self.last_update_time < self.update_interval_s:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

throttling here is a great idea

@ericl
Copy link
Contributor Author

ericl commented Dec 29, 2017

Hmm I think you can attach EBS or something using the node config, but I'm not sure how exactly. It should be in the linked API docs.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3014/
Test PASSed.

Copy link
Collaborator

@robertnishihara robertnishihara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Note there seem to be some linting errors on travis.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3021/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3025/
Test PASSed.

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/3041/
Test PASSed.

@ericl ericl merged commit b6c42f9 into ray-project:master Dec 31, 2017
@robertnishihara robertnishihara deleted the load-metrics branch December 31, 2017 22:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants