EC2 cluster setup scripts and initial version of auto-scaler #1311

ericl · 2017-12-11T23:01:36Z

What do these changes do?

This adds an autoscaler component to Ray, which can autonomously (eventually) add and remove worker nodes. This also simplifies cluster setup -- all we need to do is bootstrap a head node that in turn can set up the worker nodes.

richardliaw

slick stuff, some comments and will try out the spot instance functionality

richardliaw · 2017-12-14T21:06:58Z

python/ray/autoscaler/commands.py

+import time
+import sys
+
+import yaml


Is this an (official) dependency now?

How do I make it one?

I think add to requirements.txt and all the test setup scripts?

richardliaw · 2017-12-14T23:18:14Z

python/ray/scripts/scripts.py

+@click.option(
+    "--max-workers", required=False, type=int, help=(
+        "Override the configured max worker count for the cluster."))
+def create_or_update(


not a fan of the naming but this is just aesthetics

What would you call it? E.g.,

ray start_cluster

ray terminate_cluster

Though I guess start_cluster doesn't capture the possibility of updating an existing cluster.

One idea is ray cluster --start and ray cluster --update (similar to ray start --head and ray start --redis-address ...

ray create_or_update_cluster <>....

richardliaw · 2017-12-14T23:23:23Z

python/ray/scripts/scripts.py

+        "Override the configured min worker count for the cluster."))
+@click.option(
+    "--max-workers", required=False, type=int, help=(
+        "Override the configured max worker count for the cluster."))


"max worker node count"? (and "min worker node count" in docs above - can be confused with ray workers)

richardliaw · 2017-12-14T23:48:21Z

python/ray/autoscaler/updater.py

+                self.runtime_hash, self.node_id),
+            file=self.stdout)
+
+    def do_update(self):


Do you mind adding some comments for this?

richardliaw · 2017-12-15T00:29:56Z

python/ray/autoscaler/autoscaler.py

+            print(self.debug_string())
+            return
+        else:
+            # If enough nodes, terminate an out-of-date node.


so you kill nodes first, then you start nodes, then you kill out-of-date nodes - doesn't this result in less nodes than target?

I guess in the next loop, this will be addressed - but it's a little weird being out of sync

You're right, it's kind of an arbitrary order. I'm not sure there is a right answer here.

richardliaw · 2017-12-15T00:42:39Z

python/ray/autoscaler/updater.py

+        NodeUpdater.__init__(self, *args, **kwargs)
+
+
+class NodeUpdaterThread(NodeUpdater, Thread):


is this used anywhere?

yeah it's used for the unit test, to allow the provider to be mocked in a single process

richardliaw · 2017-12-15T00:50:12Z

python/ray/autoscaler/autoscaler.py

+                del self.updaters[node_id]
+            print(self.debug_string())
+
+    def reload_config(self, errors_fatal):


everywhere, errors_fatal is a key value - update signature to match?

richardliaw · 2017-12-15T00:54:07Z

python/ray/autoscaler/autoscaler.py

+    There are two ways to start an autoscaling cluster: manually by running
+    `ray start --head --autoscaling-config=/path/to/config.json` on a
+    instance that has permission to launch other instances, or you can also use
+    `ray bootstrap /path/to/config.json` from your laptop, which will configure


out of date?

richardliaw · 2017-12-15T00:55:11Z

python/ray/autoscaler/commands.py

+        remote_key_path = "~/ray_bootstrap_key.pem".format(
+            config["auth"]["ssh_user"])
+        cluster_config_path = "~/ray_bootstrap_config.yaml".format(
+            config["auth"]["ssh_user"])


does this do anything? There's no {} in the above two strings

richardliaw · 2017-12-15T01:01:36Z

python/ray/autoscaler/commands.py

+        new_mounts = {}
+        for remote_path in config["file_mounts"].keys():
+            new_mounts[remote_path] = remote_path
+        remote_config["file_mounts"] = new_mounts


nit: I don't think keys() is needed - also, below is less verbose

remote_config["file_mounts"] = { path: path for path in config["file_mounts"]}

Removed keys.

richardliaw · 2017-12-15T01:31:52Z

python/ray/autoscaler/autoscaler.py

+            print(
+                "StandardAutoscaler: Terminating unneeded node: "
+                "{}".format(nodes[-1]))
+            self.provider.terminate_node(nodes[-1])


naive question - Is this ok? will this have weird effects on fault tolerance/do you need to drain?

I'm guessing the fault tolerance will not be very happy with this since we don't have task checkpoints. That's fine for now, GCS integration is not implemented.

@richardliaw what do you mean by "drain"? Do you mean transfer objects away from that node to other nodes?

Oh I just meant letting tasks finish while not scheduling any more tasks on that node

ericl

@robertnishihara SSH key creation should be more robust now

ericl · 2017-12-15T19:22:52Z

python/ray/autoscaler/autoscaler.py

+    There are two ways to start an autoscaling cluster: manually by running
+    `ray start --head --autoscaling-config=/path/to/config.json` on a
+    instance that has permission to launch other instances, or you can also use
+    `ray bootstrap /path/to/config.json` from your laptop, which will configure


ericl · 2017-12-15T19:23:43Z

python/ray/autoscaler/autoscaler.py

+            print(
+                "StandardAutoscaler: Terminating unneeded node: "
+                "{}".format(nodes[-1]))
+            self.provider.terminate_node(nodes[-1])


I'm guessing the fault tolerance will not be very happy with this since we don't have task checkpoints. That's fine for now, GCS integration is not implemented.

ericl · 2017-12-15T19:25:23Z

python/ray/autoscaler/autoscaler.py

+            print(self.debug_string())
+            return
+        else:
+            # If enough nodes, terminate an out-of-date node.


You're right, it's kind of an arbitrary order. I'm not sure there is a right answer here.

ericl · 2017-12-15T19:25:35Z

python/ray/autoscaler/autoscaler.py

+                del self.updaters[node_id]
+            print(self.debug_string())
+
+    def reload_config(self, errors_fatal):


ericl · 2017-12-15T19:27:18Z

python/ray/autoscaler/aws/example.yaml

+# http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
+head_node:
+    InstanceType: m5.large
+    ImageId: ami-212d465b


Yeah that could work. Is it possible to pip install a particular travis build on an unmerged PR?

ericl · 2017-12-15T19:32:07Z

python/ray/autoscaler/updater.py

+                self.runtime_hash, self.node_id),
+            file=self.stdout)
+
+    def do_update(self):


ericl · 2017-12-15T19:32:25Z

python/ray/autoscaler/updater.py

+        NodeUpdater.__init__(self, *args, **kwargs)
+
+
+class NodeUpdaterThread(NodeUpdater, Thread):


yeah it's used for the unit test, to allow the provider to be mocked in a single process

ericl · 2017-12-15T19:33:13Z

python/ray/scripts/scripts.py

+        "Override the configured min worker count for the cluster."))
+@click.option(
+    "--max-workers", required=False, type=int, help=(
+        "Override the configured max worker count for the cluster."))


ericl · 2017-12-15T19:33:30Z

python/ray/scripts/scripts.py

+@click.option(
+    "--max-workers", required=False, type=int, help=(
+        "Override the configured max worker count for the cluster."))
+def create_or_update(


ray create_or_update_cluster <>....

ericl · 2017-12-15T19:33:50Z

test/autoscaler_test.py

@@ -0,0 +1,326 @@
+from __future__ import absolute_import


Yes, I'm always forgetting...

AmplabJenkins · 2017-12-15T20:30:28Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-12-15T20:30:28Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2810/
Test PASSed.

AmplabJenkins · 2017-12-15T20:32:12Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-12-15T20:32:13Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2811/
Test PASSed.

AmplabJenkins · 2017-12-15T20:39:25Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-12-15T20:39:25Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2812/
Test PASSed.

robertnishihara · 2017-12-15T21:02:24Z

Test failed on Travis with

�[0K$ python test/autoscaling_test.py
python: can't open file 'test/autoscaling_test.py': [Errno 2] No such file or directory

travis_time:end:0444da84:start=1513370111652119316,finish=1513370111670728297,duration=18608981
�[0K
�[31;1mThe command "python test/autoscaling_test.py" exited with 2.�[0m

It should be autoscaler_test.py

ericl · 2017-12-15T22:25:33Z

Oops! fixed

AmplabJenkins · 2017-12-15T23:11:36Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-12-15T23:11:37Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2815/
Test PASSed.

AmplabJenkins · 2017-12-16T01:20:57Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-12-16T01:20:57Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2816/
Test PASSed.

AmplabJenkins · 2017-12-16T02:00:13Z

Merged build finished. Test PASSed.

AmplabJenkins · 2017-12-16T02:00:14Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/2818/
Test PASSed.

DmitriGekhtman · 2021-08-27T22:03:09Z

python/ray/autoscaler/autoscaler.py

+
+    def _update(self):
+        nodes = self.workers()
+        target_num_workers = self.config["max_workers"]


I like target_num_workers, whatever happened to that?

ericl added 30 commits December 9, 2017 16:33

initial ray autoscaler

18d37ad

update

0be7f68

fix scaling down

38ed450

sync in separate processes

0eea787

refactor to be cloud provider agnostic

480462a

fix wait for ssh

b438cab

increase connect timeout

ce6f10f

add max failures

4b225d8

concurrent lanuch

e1ce006

rename to node provider

a3ad372

add bootstrap script

4e98c66

bootstrap work

46706f5

fix some issues

245ce17

hexdigest

ceb59bc

update autoscaler

409c5de

add autoscaling conf flag

5d460fa

runtime conf

b1c2830

fix wait bug

98c1cc8

update

17273a2

fix hashing

ad50863

fix file sync

96b4d4e

sort logs

51e2b42

add teardown

50b5261

move to aws dir

b97ae68

fix teardown

765028d

validate schema

0fbb724

docs

99021fd

fix monitor.py

17b2da0

clean up aws

b70dde5

add sleep in teardown

4e2df67

richardliaw reviewed Dec 15, 2017

View reviewed changes

ericl added 2 commits December 15, 2017 11:45

comments 2

9ed8125

Merge branch 'autoscaling' of github.com:ericl/ray into autoscaling

7f02fe4

ericl commented Dec 15, 2017

View reviewed changes

ericl added 2 commits December 15, 2017 11:52

hash auth with launch conf

e4af09a

add pyyaml as dep

05f9d05

wrong test name

f6b034f

ericl added 3 commits December 15, 2017 16:35

update lint

758989c

remove monotonic

91d7b40

fix monitor lint

499f487

robertnishihara approved these changes Dec 16, 2017

View reviewed changes

robertnishihara merged commit f5ea443 into ray-project:master Dec 16, 2017

robertnishihara deleted the autoscaling branch December 16, 2017 07:56

ericl mentioned this pull request Dec 16, 2017

Switch EC2 example config to use AWS deep learning AMI + latest Ray wheel #1331

Merged

DmitriGekhtman reviewed Aug 27, 2021

View reviewed changes

		NodeUpdater.__init__(self, args, *kwargs)


		class NodeUpdaterThread(NodeUpdater, Thread):

EC2 cluster setup scripts and initial version of auto-scaler #1311

EC2 cluster setup scripts and initial version of auto-scaler #1311

Uh oh!

Conversation

ericl commented Dec 11, 2017

What do these changes do?

Uh oh!

richardliaw left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

richardliaw Dec 15, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

richardliaw Dec 15, 2017 •

edited

Loading

robertnishihara commented Dec 15, 2017 •

edited

Loading