Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler] GCP node provider #2061

Merged
merged 73 commits into from
May 31, 2018

Conversation

hartikainen
Copy link
Contributor

@hartikainen hartikainen commented May 15, 2018

This is a work-in-progress PR for Google Cloud Platform node provider. The implementation doesn't fully work yet. Specifically, there's a problem with the pipes.quote breaking the ssh call on the GCP machines. I need to solve that issue before I can move forward with this one.

Happy to take feedback if someone has time to go through the code.

Current todos:

  • Figure out consistent way of handling instance tags (AWS and GCP use different tagging/labeling spec, specifically AWS tags break on GCP)
  • Figure out if there's a better way of creating and managing the GCP ssh keys
    • I don't know if there's a better way. We could use the service account ssh keys, but then the user would have to ssh using their gcloud cli or set the ssh keys up themselves.
  • Add project information in the ssh key names
  • Figure out why the ssh uptime command fails with pipes on GCP ubuntu image but not on AWS images
  • Clean up some of the googlecloudapi calls in config.py
  • Refactor tagging/labeling to work across multiple node_providers
  • Clean gcp label filtering
    • I think this can be left as is at least for now. Not the cleanest piece of code but that's partly a result of how gcp filters are passed in the api request.
  • Resolve TODOs
    • I left couple of TODOs in the code that can imo be left for future work.

What do these changes do?

Implements a gcp node_provider and corresponding setup and configuration.

Related issue number

#2049

@hartikainen hartikainen force-pushed the feature/gcp-node-provider branch 2 times, most recently from 285b1ab to 4144f3c Compare May 15, 2018 07:05
"-i", self.ssh_private_key,
"{}@{}".format(self.ssh_user, self.ssh_ip),
"bash --login -c {}".format(
pipes.quote(force_interactive + cmd))
Copy link
Contributor Author

@hartikainen hartikainen May 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason, I keep getting Exit Status 2 from ssh when using pipes.quote here. If I remove pipes.quote, then the ssh check (uptime command) passes, but some further calls fail. Has anyone seen this before?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I've definitely seen this before... for my case I could take out the quotes without a problem, but I guess that's not true here. I can try to put together a minimal example and post on stack overflow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, what do the result ssh commands end up looking like, and do they execute when then manually?

Copy link
Contributor Author

@hartikainen hartikainen May 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, I forgot to mention that the command succeeds normally (with exit code 0) when I copy-paste it from the stdout and run it manually from command line (zsh). Here's the actual command:

ssh -o ConnectTimeout=5s -o StrictHostKeyChecking=no -i /Users/kristian/.ssh/ray-autoscaler_gcp_1_us-west1.pem kristian@35.203.188.126 bash --login -c 'set -i && source ~/.bashrc && uptime'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK unfortunately my memory fails me and I'm not able to reproduce this (I recall getting exit code 2 when playing with screen via this command).

What's the best way to test out this branch and reproduce this issue? (ie, setup gcp configs and cluster)

Copy link
Contributor

@richardliaw richardliaw May 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK the issue is probably that bash is 4.4 on GCE and 4.3 on AWS, and set -i doesn't actually evaluate correctly in 4.4.

Looks like we actually have an issue here ... #1444.

We should probably keep pipes.quote in check_call as that's probably the correct evaluation of the method, and I think it make sense to get rid of the set -i on L187 in updater.py, make a note, and find alternatives/workarounds when we run into issues.

cc @ericl if you have any better ideas.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @richardliaw, that's a good find. If getting rid of the set -i solves the problem, I think your suggesting would be good enough at least for now. I'll verify that tomorrow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, forked this branch and the change works for me. Let me know if you run into
any other issues!

Copy link
Contributor Author

@hartikainen hartikainen May 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw @richardliaw, I noticed that adding ' around the command and using pipes.quote at the same time also solves this problem. I.e., if I do:

"bash --login -c '{}'".format(quote(force_interactive + cmd))

instead of

"bash --login -c {}".format(quote(force_interactive + cmd))

then the set -i does not cause any problems. Seems like the code elsewhere does this too:

local_to_remote_sync_cmd = ("aws s3 sync '{}' '{}'".format(

This might lead to some other issues however.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting - seems odd, as set -i will throw an error if ran independently in the shell... I guess it doesn't matter for now/we can deal with it in a separate PR, as it seems like you have a couple workarounds!

@@ -57,7 +57,7 @@ def teardown_cluster(config_file, yes):

provider = get_node_provider(config["provider"], config["cluster_name"])
head_node_tags = {
TAG_RAY_NODE_TYPE: "Head",
TAG_RAY_NODE_TYPE: "head",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that GCP label system doesn't support uppercase characters. I've temporarily changed all the tags to follow GCP format, but will eventually refactor the tagging system to support both.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5398/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5400/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5410/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5414/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5415/
Test PASSed.


from ray.ray_constants import BOTO_MAX_RETRIES

crm = discovery.build('cloudresourcemanager', 'v1')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to do something like this?

    from google.auth import compute_engine
    credentials = compute_engine.Credentials()
    crm = discovery.build('cloudresourcemanager', 'v1', credentials=credentials)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is the standard way to just use the environment variable?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVM got it to work via gcloud auth application-default login

@richardliaw
Copy link
Contributor

richardliaw commented May 16, 2018

For others who haven't gone through setting up GCE authentication - an incomplete list of steps:

  1. https://cloud.google.com/sdk/downloads#versioned
  2. gcloud auth application-default login

List of random issues:

  • After creating a head node, I am unable to ssh due to the following error:
    *** subprocess.CalledProcessError: Command '['ssh', '-o', 'ConnectTimeout=120s', '-o', 'StrictHostKeyChecking=no', '-i', '/Users/rliaw/.ssh/ray-autoscaler_gcp_3_us-west1.pem', 'ubuntu@104.198.7.195', "bash --login -c 'set -i && source ~/.bashrc && uptime'"]' returned non-zero exit status 255. . --- addressed by using Google Account Login instead of ubuntu
    • It's also not clear if the pem is actually working - currently setting private key to be .ssh/google_compute_engine
  • It seems like GCP pems are being created every time I run create_or_update

@hartikainen
Copy link
Contributor Author

hartikainen commented May 17, 2018

@richardliaw, I believe the error in your case happens due to your login name (ubuntu) being incorrect. The default ubuntu image that I use uses your google account name as login name (for me it was just kristian), or alternatively no login name at all.

@richardliaw
Copy link
Contributor

Ah I also just figured that out - thanks!

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5467/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5677/
Test PASSed.

@richardliaw
Copy link
Contributor

hey the following lint commands are still failing:

flake8 --exclude=python/ray/core/src/common/flatbuffers_ep-prefix/,python/ray/core/generated/,src/common/format/,doc/source/conf.py,python/ray/cloudpickle/
.travis/yapf.sh

Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read through everything except for config.py and it looks pretty good; will try it out right now.

@richardliaw
Copy link
Contributor

richardliaw commented May 29, 2018

Trying out the example yaml; it seems like every single project needs to be unique? ie, if a person wanted to use example-full.yaml, they would have to modify the project_id?

This is fine, and perhaps if so we should just leave it blank and also raise a better error (as a TODO or something)

Trace here:

  File "/Users/rliaw/Research/riselab/ray/python/ray/autoscaler/gcp/config.py", line 131, in _configure_project
    _create_project(project_id)
  File "/Users/rliaw/Research/riselab/ray/python/ray/autoscaler/gcp/config.py", line 333, in _create_project
    "name": project_id
  File "/Users/rliaw/miniconda3/envs/ray/lib/python3.6/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/Users/rliaw/miniconda3/envs/ray/lib/python3.6/site-packages/googleapiclient/http.py", line 840, in execute
    raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 409 when requesting https://cloudresourcemanager.googleapis.com/v1/projects?alt=json returned "Requested entity already exists">

Overall, it works very smoothly! The issue with creating a new key-pair each is resolved. I think after linting we can merge, and then push out documentation + beta test it after the merge.

@richardliaw richardliaw changed the title WIP: [autoscaler] Feature/gcp node provider [autoscaler] Feature/gcp node provider May 29, 2018
@hartikainen
Copy link
Contributor Author

@richardliaw All the linting error should be fixed now. I also changed the handling of project_ids. See: a59f81c

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5702/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5701/
Test PASSed.

@hartikainen
Copy link
Contributor Author

Not sure what fails in the travis job right now. The job fails to

19.56s$ .travis/yapf.sh
Reformatted staged files. Please review and stage the changes.
Files updated:
python/ray/autoscaler/updater.py
The command ".travis/yapf.sh" exited with 1.

Running yapf locally doesn't produce any errors or diffs on the autoscaler files.

Anything obvious I'm missing here?

@robertnishihara
Copy link
Collaborator

robertnishihara commented May 30, 2018 via email

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5719/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5724/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5743/
Test FAILed.

@richardliaw
Copy link
Contributor

jenkins retest this please

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/5752/
Test PASSed.

Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me!

@richardliaw richardliaw changed the title [autoscaler] Feature/gcp node provider [autoscaler] GCP node provider May 31, 2018
@richardliaw
Copy link
Contributor

Test failures unrelated - will merge by tomorrow morning if @ericl has no comments

@richardliaw
Copy link
Contributor

Test failures look unrelated; if there are issues we can fix them in another PR. Thanks for contributing this @hartikainen!

@richardliaw richardliaw merged commit 74dc14d into ray-project:master May 31, 2018
alok added a commit to alok/ray that referenced this pull request Jun 3, 2018
* master:
  [autoscaler] GCP node provider (ray-project#2061)
  [xray] Evict tasks from the lineage cache (ray-project#2152)
  [ASV] Add ray.init and simple Ray benchmarks (ray-project#2166)
  Re-encrypt key for uploading to S3 from travis to use travis-ci.com. (ray-project#2169)
  [rllib] Fix A3C PyTorch implementation (ray-project#2036)
  [JavaWorker] Do not kill local-scheduler-forked workers in RunManager.cleanup (ray-project#2151)
  Update Travis CI badge from travis-ci.org to travis-ci.com. (ray-project#2155)
  Implement Python global state API for xray. (ray-project#2125)
  [xray] Improve flush algorithm for the lineage cache (ray-project#2130)
  Fix support for actor classmethods (ray-project#2146)
  Add empty df test (ray-project#1879)
  [JavaWorker] Enable java worker support (ray-project#2094)
  [DataFrame] Fixing the code formatting of the tests (ray-project#2123)
  Update resource documentation (remove outdated limitations). (ray-project#2022)
  bugfix: use array redis_primary_addr out of its scope (ray-project#2139)
  Fix infinite retry in Push function. (ray-project#2133)
  [JavaWorker] Changes to the directory under src for support java worker (ray-project#2093)
  Integrate credis with Ray & route task table entries into credis. (ray-project#1841)
@hartikainen hartikainen deleted the feature/gcp-node-provider branch February 27, 2019 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants