-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto-scale ray clusters based on GCS load metrics #1348
Conversation
Merged build finished. Test PASSed. |
Test PASSed. |
Just tried this with the default config
It successfully started a node, but then it printed
for some reason the IP address is |
python/ray/autoscaler/autoscaler.py
Outdated
@@ -64,6 +74,93 @@ | |||
# Max number of nodes to launch at a time. | |||
MAX_CONCURRENT_LAUNCHES = 10 | |||
|
|||
# Print debug string once every this many seconds | |||
DEBUG_INTERVAL_S = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These constants MAX_NUM_FAILURES
, MAX_CONCURRENT_LAUNCHES
, and DEBUG_INTERVAL_S
should go in https://github.com/ray-project/ray/blob/master/src/common/state/ray_config.h. All constants are there to simplify debugging (these kinds of constants often need to be changed during debugging) so that you don't have to track down the constants throughout the whole codebase (these are not considered exposed to users).
If you want I can make this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you access them from Python? I'd prefer to keep Python constants in .py files, since it avoids recompiles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can see how it's done in #1192, it actually involves a number of files to expose it to Python.
Good point about recompilation. Maybe we should have two files. One for C constants and one for Python constants. What do you think about creating ray/python/ray/constants.py
?
I'm getting timeout errors on the current master
It seems a bit opaque to me. Any idea about this? |
Hmm, when you run the SSH command manually does it work? It looks like it can't reach the instance for some reason. |
@ericl I just tried again and this time it succeeded. I'll let you know if I can reproduce the original problem. |
python/ray/autoscaler/autoscaler.py
Outdated
@@ -4,8 +4,10 @@ | |||
|
|||
import json | |||
import hashlib | |||
import math |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we just use numpy everywhere?
fd9e762
to
6172f78
Compare
Oh, I just noticed you removed the line
adding that back in along with
should fix it. I'm trying this out and will push the diff if it works |
Oh ic, I removed that since I thought it was unneeded. Where does that path entry come from on login if not bashrc? |
Merged build finished. Test PASSed. |
Test PASSed. |
I see, I'll remove the source thing then. The line |
Merged build finished. Test PASSed. |
Test PASSed. |
# Install basics. | ||
- sudo apt-get update | ||
- sudo apt-get install -y cmake pkg-config build-essential autoconf curl libtool unzip python | ||
# TODO(ekl): are these commands idempotent? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they are definitely not idempotent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some || trues which should fix it for now.
@ericl Just ran out of disk space on the cluster while testing this. Is there a way to increase disk space through the yaml file? |
@@ -850,6 +850,7 @@ def start_worker(node_ip_address, object_store_name, object_store_manager_name, | |||
default. | |||
""" | |||
command = [sys.executable, | |||
"-u", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great idea :)
python/ray/autoscaler/autoscaler.py
Outdated
|
||
# The autoscaler will attempt to restart Ray on nodes it hasn't heard from | ||
# in more than this interval. | ||
HEARTBEAT_TIMEOUT_S = 30 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think it'd be a good idea to have a separate file with all of the constants. That way people don't have to search through all the files to find all the constants they need to change while debugging. What do you think about putting these in a separate ray_constants.py
file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -123,43 +247,59 @@ def update(self): | |||
raise e | |||
|
|||
def _update(self): | |||
# Throttle autoscaling updates to this interval to avoid exceeding | |||
# rate limits on API calls. | |||
if time.time() - self.last_update_time < self.update_interval_s: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
throttling here is a great idea
Hmm I think you can attach EBS or something using the node config, but I'm not sure how exactly. It should be in the linked API docs. |
Merged build finished. Test PASSed. |
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Note there seem to be some linting errors on travis.
Merged build finished. Test PASSed. |
Test PASSed. |
Merged build finished. Test PASSed. |
Test PASSed. |
Merged build finished. Test PASSed. |
Test PASSed. |
What do these changes do?
This adds (experimental) auto-scaling support for Ray clusters based on GCS load metrics. The auto-scaling algorithm is as follows:
1 / target_utilization_fraction
and round up to determine the target cluster size (subject to themax_workers
constraint). The autoscaler control loop takes care of launching new nodes until the target cluster size is met.idle_timeout_minutes
, we remove it from the cluster if that would not drop the cluster size belowmin_workers
.Note that we'll need to update the wheel in the example yaml file after this PR is merged.