Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-scale ray clusters based on GCS load metrics #1348

Merged
merged 36 commits into from
Dec 31, 2017
Merged
Changes from 1 commit
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
a8eb626
wip gc metrics autoscaling
ericl Dec 17, 2017
4c09b68
add load metrics debug string
ericl Dec 17, 2017
c6a9c2c
add ray ip
ericl Dec 17, 2017
cc1d722
wire it end to end
ericl Dec 17, 2017
6732223
wip dev
ericl Dec 17, 2017
419ca59
fix bug
ericl Dec 17, 2017
a8240b8
update
ericl Dec 17, 2017
c49a35a
Sun Dec 17 14:19:03 PST 2017
ericl Dec 17, 2017
9a69baf
wip
ericl Dec 17, 2017
03f47de
wip
ericl Dec 19, 2017
97dfa7f
add update throttling; reorg init commands
ericl Dec 21, 2017
3ccb5fc
Wed Dec 20 23:44:56 PST 2017
ericl Dec 21, 2017
d805f90
Wed Dec 20 23:51:37 PST 2017
ericl Dec 21, 2017
13436dd
Wed Dec 20 23:58:18 PST 2017
ericl Dec 21, 2017
f6d96d1
Thu Dec 21 00:03:19 PST 2017
ericl Dec 21, 2017
7fe7ec7
Thu Dec 21 00:04:33 PST 2017
ericl Dec 21, 2017
60a487b
Thu Dec 21 00:05:32 PST 2017
ericl Dec 21, 2017
68df74a
Thu Dec 21 00:15:11 PST 2017
ericl Dec 21, 2017
ead1984
Thu Dec 21 00:30:54 PST 2017
ericl Dec 21, 2017
49433bf
Thu Dec 21 00:39:34 PST 2017
ericl Dec 21, 2017
ef435c9
Thu Dec 21 01:16:11 PST 2017
ericl Dec 21, 2017
c2d9efc
Thu Dec 21 01:25:24 PST 2017
ericl Dec 21, 2017
2f977a4
update
ericl Dec 21, 2017
b816480
update example
ericl Dec 21, 2017
4d4d9d1
Merge remote-tracking branch 'upstream/master' into load-metrics
ericl Dec 25, 2017
006ef46
fix tests
ericl Dec 26, 2017
f40253d
unit tests
ericl Dec 26, 2017
40ddf3c
Mon Dec 25 16:39:56 PST 2017
ericl Dec 26, 2017
c88df7d
fi xlint
ericl Dec 26, 2017
d9e7df6
fix np ceil
ericl Dec 27, 2017
4280e53
Fix path for development-example.yaml
robertnishihara Dec 28, 2017
1969bcf
Remove unnecessary line.
robertnishihara Dec 29, 2017
bef8e29
fix idempotent
ericl Dec 30, 2017
4093407
Update autoscaler.py
ericl Dec 30, 2017
b5df753
fix nondterministic test
ericl Dec 30, 2017
3237986
Update autoscaler.py
ericl Dec 31, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Update autoscaler.py
  • Loading branch information
ericl authored Dec 30, 2017
commit 4093407918977136eb54c6cf025ddc7185c9b45c
23 changes: 4 additions & 19 deletions python/ray/autoscaler/autoscaler.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,21 +84,6 @@
}


# Abort autoscaling if more than this number of errors are encountered. This
# is a safety feature to prevent e.g. runaway node launches.
MAX_NUM_FAILURES = 5

# Max number of nodes to launch at a time.
MAX_CONCURRENT_LAUNCHES = 10

# Interval at which to perform autoscaling updates.
UPDATE_INTERVAL_S = 5

# The autoscaler will attempt to restart Ray on nodes it hasn't heard from
# in more than this interval.
HEARTBEAT_TIMEOUT_S = 30


class LoadMetrics(object):
"""Container for cluster load metrics.

Expand Down Expand Up @@ -207,10 +192,10 @@ class StandardAutoscaler(object):

def __init__(
self, config_path, load_metrics,
max_concurrent_launches=MAX_CONCURRENT_LAUNCHES,
max_failures=MAX_NUM_FAILURES, process_runner=subprocess,
max_concurrent_launches=AUTOSCALER_MAX_CONCURRENT_LAUNCHES,
max_failures=AUTOSCALER_MAX_NUM_FAILURES, process_runner=subprocess,
verbose_updates=False, node_updater_cls=NodeUpdaterProcess,
update_interval_s=UPDATE_INTERVAL_S):
update_interval_s=AUTOSCALER_UPDATE_INTERVAL_S):
self.config_path = config_path
self.reload_config(errors_fatal=True)
self.load_metrics = load_metrics
Expand Down Expand Up @@ -382,7 +367,7 @@ def recover_if_needed(self, node_id):
return
last_heartbeat_time = self.load_metrics.last_heartbeat_time_by_ip.get(
self.provider.internal_ip(node_id), 0)
if time.time() - last_heartbeat_time < HEARTBEAT_TIMEOUT_S:
if time.time() - last_heartbeat_time < AUTOSCALER_HEARTBEAT_TIMEOUT_S:
return
print("StandardAutoscaler: Restarting Ray on {}".format(node_id))
updater = self.node_updater_cls(
Expand Down