-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[autoscaler] GCP node provider #2061
[autoscaler] GCP node provider #2061
Conversation
285b1ab
to
4144f3c
Compare
python/ray/autoscaler/updater.py
Outdated
"-i", self.ssh_private_key, | ||
"{}@{}".format(self.ssh_user, self.ssh_ip), | ||
"bash --login -c {}".format( | ||
pipes.quote(force_interactive + cmd)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason, I keep getting Exit Status 2
from ssh when using pipes.quote
here. If I remove pipes.quote
, then the ssh check (uptime
command) passes, but some further calls fail. Has anyone seen this before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I've definitely seen this before... for my case I could take out the quotes without a problem, but I guess that's not true here. I can try to put together a minimal example and post on stack overflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, what do the result ssh commands end up looking like, and do they execute when then manually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, I forgot to mention that the command succeeds normally (with exit code 0) when I copy-paste it from the stdout and run it manually from command line (zsh). Here's the actual command:
ssh -o ConnectTimeout=5s -o StrictHostKeyChecking=no -i /Users/kristian/.ssh/ray-autoscaler_gcp_1_us-west1.pem kristian@35.203.188.126 bash --login -c 'set -i && source ~/.bashrc && uptime'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK unfortunately my memory fails me and I'm not able to reproduce this (I recall getting exit code 2 when playing with screen via this command).
What's the best way to test out this branch and reproduce this issue? (ie, setup gcp configs and cluster)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK the issue is probably that bash is 4.4 on GCE and 4.3 on AWS, and set -i
doesn't actually evaluate correctly in 4.4.
Looks like we actually have an issue here ... #1444.
We should probably keep pipes.quote
in check_call
as that's probably the correct evaluation of the method, and I think it make sense to get rid of the set -i
on L187 in updater.py
, make a note, and find alternatives/workarounds when we run into issues.
cc @ericl if you have any better ideas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @richardliaw, that's a good find. If getting rid of the set -i
solves the problem, I think your suggesting would be good enough at least for now. I'll verify that tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, forked this branch and the change works for me. Let me know if you run into
any other issues!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw @richardliaw, I noticed that adding '
around the command and using pipes.quote
at the same time also solves this problem. I.e., if I do:
"bash --login -c '{}'".format(quote(force_interactive + cmd))
instead of
"bash --login -c {}".format(quote(force_interactive + cmd))
then the set -i
does not cause any problems. Seems like the code elsewhere does this too:
ray/python/ray/tune/log_sync.py
Line 94 in f1da721
local_to_remote_sync_cmd = ("aws s3 sync '{}' '{}'".format( |
This might lead to some other issues however.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting - seems odd, as set -i
will throw an error if ran independently in the shell... I guess it doesn't matter for now/we can deal with it in a separate PR, as it seems like you have a couple workarounds!
@@ -57,7 +57,7 @@ def teardown_cluster(config_file, yes): | |||
|
|||
provider = get_node_provider(config["provider"], config["cluster_name"]) | |||
head_node_tags = { | |||
TAG_RAY_NODE_TYPE: "Head", | |||
TAG_RAY_NODE_TYPE: "head", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting that GCP label system doesn't support uppercase characters. I've temporarily changed all the tags to follow GCP format, but will eventually refactor the tagging system to support both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
Test PASSed. |
Test PASSed. |
Test PASSed. |
Test PASSed. |
Test PASSed. |
python/ray/autoscaler/gcp/config.py
Outdated
|
||
from ray.ray_constants import BOTO_MAX_RETRIES | ||
|
||
crm = discovery.build('cloudresourcemanager', 'v1') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to do something like this?
from google.auth import compute_engine
credentials = compute_engine.Credentials()
crm = discovery.build('cloudresourcemanager', 'v1', credentials=credentials)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or is the standard way to just use the environment variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NVM got it to work via gcloud auth application-default login
For others who haven't gone through setting up GCE authentication - an incomplete list of steps:
List of random issues:
|
@richardliaw, I believe the error in your case happens due to your login name ( |
Ah I also just figured that out - thanks! |
Test PASSed. |
These are likely to be not needed at all. Saving them if we happen to need them later.
Test PASSed. |
hey the following lint commands are still failing:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
read through everything except for config.py
and it looks pretty good; will try it out right now.
Trying out the example yaml; it seems like every single project needs to be unique? ie, if a person wanted to use This is fine, and perhaps if so we should just leave it blank and also raise a better error (as a TODO or something) Trace here:
Overall, it works very smoothly! The issue with creating a new key-pair each is resolved. I think after linting we can merge, and then push out documentation + beta test it after the merge. |
@richardliaw All the linting error should be fixed now. I also changed the handling of |
Test FAILed. |
Test PASSed. |
Not sure what fails in the travis job right now. The job fails to 19.56s$ .travis/yapf.sh
Reformatted staged files. Please review and stage the changes.
Files updated:
python/ray/autoscaler/updater.py
The command ".travis/yapf.sh" exited with 1. Running yapf locally doesn't produce any errors or diffs on the autoscaler files. Anything obvious I'm missing here? |
Could it be a mismatch in the version of yapf?
…On Wed, May 30, 2018 at 9:38 AM Kristian Hartikainen < ***@***.***> wrote:
Not sure what fails in the travis job right now. The job fails to
19.56s$ .travis/yapf.sh
Reformatted staged files. Please review and stage the changes.
Files updated:
python/ray/autoscaler/updater.py
The command ".travis/yapf.sh" exited with 1.
Running yapf locally doesn't produce any errors or diffs on the autoscaler
files.
Anything obvious I'm missing here?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2061 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAPOramA6uvz6yCImzb_uB9ND35DErHnks5t3ssigaJpZM4T-_bU>
.
|
Test PASSed. |
Test PASSed. |
Test FAILed. |
jenkins retest this please |
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me!
Test failures unrelated - will merge by tomorrow morning if @ericl has no comments |
Test failures look unrelated; if there are issues we can fix them in another PR. Thanks for contributing this @hartikainen! |
* master: [autoscaler] GCP node provider (ray-project#2061) [xray] Evict tasks from the lineage cache (ray-project#2152) [ASV] Add ray.init and simple Ray benchmarks (ray-project#2166) Re-encrypt key for uploading to S3 from travis to use travis-ci.com. (ray-project#2169) [rllib] Fix A3C PyTorch implementation (ray-project#2036) [JavaWorker] Do not kill local-scheduler-forked workers in RunManager.cleanup (ray-project#2151) Update Travis CI badge from travis-ci.org to travis-ci.com. (ray-project#2155) Implement Python global state API for xray. (ray-project#2125) [xray] Improve flush algorithm for the lineage cache (ray-project#2130) Fix support for actor classmethods (ray-project#2146) Add empty df test (ray-project#1879) [JavaWorker] Enable java worker support (ray-project#2094) [DataFrame] Fixing the code formatting of the tests (ray-project#2123) Update resource documentation (remove outdated limitations). (ray-project#2022) bugfix: use array redis_primary_addr out of its scope (ray-project#2139) Fix infinite retry in Push function. (ray-project#2133) [JavaWorker] Changes to the directory under src for support java worker (ray-project#2093) Integrate credis with Ray & route task table entries into credis. (ray-project#1841)
This is a work-in-progress PR for Google Cloud Platform node provider. The implementation doesn't fully work yet. Specifically, there's a problem with the
pipes.quote
breaking the ssh call on the GCP machines. I need to solve that issue before I can move forward with this one.Happy to take feedback if someone has time to go through the code.
Current todos:
gcloud
cli or set the ssh keys up themselves.pipes
on GCP ubuntu image but not on AWS imagesWhat do these changes do?
Implements a gcp node_provider and corresponding setup and configuration.
Related issue number
#2049