Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler] GCP node provider #2061

Merged
merged 73 commits into from
May 31, 2018
Merged
Show file tree
Hide file tree
Changes from 67 commits
Commits
Show all changes
73 commits
Select commit Hold shift + click to select a range
a896450
Google Cloud Platform scaffolding
hartikainen May 15, 2018
a3b44df
Add minimal gcp config example
hartikainen May 15, 2018
281c7b6
Add googleapiclient discoveries, update gcp.config constants
hartikainen May 15, 2018
8cd3287
Rename and update gcp.config key pair name function
hartikainen May 15, 2018
8615abb
Implement gcp.config._configure_project
hartikainen May 15, 2018
f0539d3
Fix the create project get project flow
hartikainen May 15, 2018
ba8cdbf
Implement gcp.config._configure_iam_role
hartikainen May 15, 2018
012c5d8
Implement service account iam binding
hartikainen May 15, 2018
f88449b
Implement gcp.config._configure_key_pair
hartikainen May 15, 2018
08f53a4
Implement rsa key pair generation
hartikainen May 15, 2018
d17e244
Implement gcp.config._configure_subnet
hartikainen May 15, 2018
3e67a60
Save work-in-progress gcp.config._configure_firewall_rules.
hartikainen May 15, 2018
d05df31
Remove unnecessary firewall configuration
hartikainen May 15, 2018
b499ea0
Update example-minimal.yaml configuration
hartikainen May 15, 2018
5deb6ba
Add new wait_for_compute_operation, rename old wait_for_operation
hartikainen May 15, 2018
352a4ff
Temporarily rename autoscaler tags due to gcp incompatibility
hartikainen May 15, 2018
16a8605
Implement initial gcp.node_provider.nodes
hartikainen May 15, 2018
944468f
Implement initial gcp.node_provider.create_node
hartikainen May 15, 2018
9a5c8a3
Implement initial gcp.node_provider._node and node status functions
hartikainen May 15, 2018
6c78b40
Implement initial gcp.node_provider.terminate_node
hartikainen May 15, 2018
bd09fbc
Implement node tagging and ip getter methods for nodes
hartikainen May 15, 2018
52def7b
Temporarily rename tags due to gcp incompatibility
hartikainen May 15, 2018
ad301bb
Tiny tweaks for autoscaler.updater
hartikainen May 15, 2018
9a1d052
Remove unused config from gcp node_provider
hartikainen May 15, 2018
423f791
Add new example-full example to gcp, update load_gcp_example_config
hartikainen May 15, 2018
5d90308
Implement label filtering for gcp.node_provider.nodes
hartikainen May 15, 2018
f5634d3
Revert unnecessary change in ssh command
hartikainen May 15, 2018
7e1ea09
Revert "Temporarily rename tags due to gcp incompatibility"
hartikainen May 16, 2018
d71d4e4
Revert "Temporarily rename autoscaler tags due to gcp incompatibility"
hartikainen May 16, 2018
9cf4840
Refactor autoscaler tagging to support multiple tag specs
hartikainen May 16, 2018
7037922
Remove missing cryptography imports
hartikainen May 16, 2018
d9bea64
Update quote function import
hartikainen May 18, 2018
bd07ff1
Fix threading issue in gcp.config with the compute discovery object
hartikainen May 18, 2018
e6559fd
Add gcs support for log_sync
hartikainen May 19, 2018
de2980b
Fix the labels/tags naming discrepancy
hartikainen May 19, 2018
a221c0d
Add expanduser to file_mounts hashing
hartikainen May 19, 2018
9c3899c
Fix gcp.node_provider.internal_ip
hartikainen May 19, 2018
c3341b0
Add uuid to node name
hartikainen May 20, 2018
5fa034c
Remove 'set -i' from updater ssh command
hartikainen May 20, 2018
c152faa
Update ssh key creation in autoscaler.gcp.config
hartikainen May 20, 2018
5948f0f
Fix wait_for_compute_zone_operation's threading issue
hartikainen May 20, 2018
f40d7ef
Address pr feedback from @ericl
hartikainen May 20, 2018
44a3458
Expand local file mount paths in NodeUpdater
hartikainen May 20, 2018
646dc81
Add ssh_user name to key names
hartikainen May 21, 2018
23a0066
Update updater ssh to attempt 'set -i' and fall back if that fails
hartikainen May 21, 2018
e50dd00
Update gcp/example-full.yaml
hartikainen May 21, 2018
85c1e4b
Fix wait crm operation in gcp.config
hartikainen May 21, 2018
5517743
Update gcp/example-minimal.yaml to match aws/example-minimal.yaml
hartikainen May 21, 2018
35d946e
Fix gcp/example-full.yaml comment indentation
hartikainen May 21, 2018
dd8fc5f
Add gcp/example-full.yaml to setup files
hartikainen May 22, 2018
693de75
Update example-full.yaml command
hartikainen May 24, 2018
183453e
Revert "Refactor autoscaler tagging to support multiple tag specs"
hartikainen May 24, 2018
46250d3
Update tag spec to only use characters [0-9a-z_-]
hartikainen May 24, 2018
7a84bbd
Change the tag values to conform gcp spec
hartikainen May 24, 2018
3e7f91e
Add project_id in the ssh key name
hartikainen May 24, 2018
b81ab5a
Replace '_' with '-' in autoscaler tag names
hartikainen May 26, 2018
9f87340
Revert "Update updater ssh to attempt 'set -i' and fall back if that …
hartikainen May 26, 2018
a791241
Revert "Remove 'set -i' from updater ssh command"
hartikainen May 26, 2018
1489971
Add fallback to `set -i` in force_interactive command
hartikainen May 26, 2018
ae5a586
Update autoscaler tests to match current implementation
hartikainen May 26, 2018
8e37ab4
Update GCPNodeProvider.create_node to include hash in instance name
hartikainen May 26, 2018
79c0c19
Add support for creating multiple instance on one create_node call
hartikainen May 26, 2018
34a403a
Clean TODOs
hartikainen May 27, 2018
ccbe2aa
Update styles
hartikainen May 27, 2018
5d78ef5
Remove unnecessary comment. Fix indentation.
hartikainen May 27, 2018
d8f66b3
Merge branch 'master' into feature/gcp-node-provider
hartikainen May 28, 2018
6856117
Yapfify files that fail flake8 test
hartikainen May 29, 2018
41e90ed
Yapfify more files
hartikainen May 30, 2018
a59f81c
Update project_id handling in gcp node provider
hartikainen May 30, 2018
feeb3a8
Merge branch 'master' into hartikainen-feature/gcp-node-provider
richardliaw May 30, 2018
b6744e4
temporary yapf mod
richardliaw May 30, 2018
dd6b5ab
Revert "temporary yapf mod"
hartikainen May 31, 2018
940c1b1
Fix autoscaler/updater.py lint error, remove unused variable
hartikainen May 31, 2018
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 20 additions & 9 deletions python/ray/autoscaler/autoscaler.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,9 @@
get_default_config
from ray.autoscaler.updater import NodeUpdaterProcess
from ray.autoscaler.docker import dockerize_if_needed
from ray.autoscaler.tags import TAG_RAY_LAUNCH_CONFIG, \
TAG_RAY_RUNTIME_CONFIG, TAG_RAY_NODE_STATUS, TAG_RAY_NODE_TYPE, TAG_NAME
from ray.autoscaler.tags import (TAG_RAY_LAUNCH_CONFIG, TAG_RAY_RUNTIME_CONFIG,
TAG_RAY_NODE_STATUS, TAG_RAY_NODE_TYPE,
TAG_RAY_NODE_NAME)
import ray.services as services

REQUIRED, OPTIONAL = True, False
Expand Down Expand Up @@ -58,6 +59,7 @@
"availability_zone": (str, OPTIONAL), # e.g. us-east-1a
"module": (str,
OPTIONAL), # module, if using external node provider
"project_id": (str, OPTIONAL), # gcp project id, if using gcp
},
REQUIRED),

Expand Down Expand Up @@ -244,6 +246,14 @@ def __init__(self,
self.last_update_time = 0.0
self.update_interval_s = update_interval_s

# Expand local file_mounts to allow ~ in the paths. This can't be done
# earlier when the config is written since we might be on different
# platform and the expansion would result in wrong path.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this!

self.config["file_mounts"] = {
remote: os.path.expanduser(local)
for remote, local in self.config["file_mounts"].items()
}

for local_path in self.config["file_mounts"].values():
assert os.path.exists(local_path)

Expand All @@ -254,8 +264,8 @@ def update(self):
self.reload_config(errors_fatal=False)
self._update()
except Exception as e:
print("StandardAutoscaler: Error during autoscaling: {}",
traceback.format_exc())
print("StandardAutoscaler: Error during autoscaling: {}"
"".format(traceback.format_exc()))
self.num_failures += 1
if self.num_failures > self.max_failures:
print("*** StandardAutoscaler: Too many errors, abort. ***")
Expand Down Expand Up @@ -446,17 +456,18 @@ def launch_new_node(self, count):
num_before = len(self.workers())
self.provider.create_node(
self.config["worker_nodes"], {
TAG_NAME: "ray-{}-worker".format(self.config["cluster_name"]),
TAG_RAY_NODE_TYPE: "Worker",
TAG_RAY_NODE_STATUS: "Uninitialized",
TAG_RAY_NODE_NAME: "ray-{}-worker".format(
self.config["cluster_name"]),
TAG_RAY_NODE_TYPE: "worker",
TAG_RAY_NODE_STATUS: "uninitialized",
TAG_RAY_LAUNCH_CONFIG: self.launch_hash,
}, count)
if len(self.workers()) <= num_before:
print("Warning: Num nodes failed to increase after node creation")

def workers(self):
return self.provider.nodes(tag_filters={
TAG_RAY_NODE_TYPE: "Worker",
TAG_RAY_NODE_TYPE: "worker",
})

def debug_string(self, nodes=None):
Expand Down Expand Up @@ -565,7 +576,7 @@ def add_content_hashes(path):
with open(os.path.join(dirpath, name), "rb") as f:
hasher.update(f.read())
else:
with open(path, 'r') as f:
with open(os.path.expanduser(path), "r") as f:
hasher.update(f.read().encode("utf-8"))

hasher.update(json.dumps(sorted(file_mounts.items())).encode("utf-8"))
Expand Down
11 changes: 6 additions & 5 deletions python/ray/autoscaler/commands.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
hash_launch_conf, fillout_defaults
from ray.autoscaler.node_provider import get_node_provider, NODE_PROVIDERS
from ray.autoscaler.tags import TAG_RAY_NODE_TYPE, TAG_RAY_LAUNCH_CONFIG, \
TAG_NAME
TAG_RAY_NODE_NAME
from ray.autoscaler.updater import NodeUpdaterProcess


Expand Down Expand Up @@ -57,7 +57,7 @@ def teardown_cluster(config_file, yes):

provider = get_node_provider(config["provider"], config["cluster_name"])
head_node_tags = {
TAG_RAY_NODE_TYPE: "Head",
TAG_RAY_NODE_TYPE: "head",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that GCP label system doesn't support uppercase characters. I've temporarily changed all the tags to follow GCP format, but will eventually refactor the tagging system to support both.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

}
for node in provider.nodes(head_node_tags):
print("Terminating head node {}".format(node))
Expand All @@ -76,7 +76,7 @@ def get_or_create_head_node(config, no_restart, yes):

provider = get_node_provider(config["provider"], config["cluster_name"])
head_node_tags = {
TAG_RAY_NODE_TYPE: "Head",
TAG_RAY_NODE_TYPE: "head",
}
nodes = provider.nodes(head_node_tags)
if len(nodes) > 0:
Expand All @@ -98,7 +98,8 @@ def get_or_create_head_node(config, no_restart, yes):
provider.terminate_node(head_node)
print("Launching new head node...")
head_node_tags[TAG_RAY_LAUNCH_CONFIG] = launch_hash
head_node_tags[TAG_NAME] = "ray-{}-head".format(config["cluster_name"])
head_node_tags[TAG_RAY_NODE_NAME] = "ray-{}-head".format(
config["cluster_name"])
provider.create_node(config["head_node"], head_node_tags, 1)

nodes = provider.nodes(head_node_tags)
Expand Down Expand Up @@ -185,7 +186,7 @@ def get_head_node_ip(config_file):
config = yaml.load(open(config_file).read())
provider = get_node_provider(config["provider"], config["cluster_name"])
head_node_tags = {
TAG_RAY_NODE_TYPE: "Head",
TAG_RAY_NODE_TYPE: "head",
}
nodes = provider.nodes(head_node_tags)
if len(nodes) > 0:
Expand Down
Empty file.
Loading