Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler] Kubernetes autoscaler backend #5492

Merged
merged 48 commits into from
Oct 3, 2019
Merged

Conversation

edoakes
Copy link
Contributor

@edoakes edoakes commented Aug 20, 2019

Why are these changes needed?

Currently, the only way to run a Ray cluster on Kubernetes is to manually deploy Kubernetes configs containing the head and worker nodes.

What do these changes do?

Adds Kubernetes as a backend to the autoscaler so that ray up can be used to start a Ray cluster on top of a shared Kubernetes cluster.

The head node runs as a pod with permissions to create, watch, and update the worker pods. Permissions are automatically configured by the autoscaler, which can be run from any machine that has access to the Kubernetes cluster (i.e., can successfully run kubectl commands against it).

Related issue number

#1437

Linter

  • I've run scripts/format.sh to lint the changes in this PR.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16424/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16425/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16438/
Test PASSed.

@richardliaw
Copy link
Contributor

Hey @edoakes, in this PR, can you make sure to update the relevant files of the documentation? (Marking things as experimental as necessary).

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16473/
Test FAILed.

@edoakes edoakes requested review from ericl and pcmoritz August 23, 2019 01:51
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16474/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16475/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16477/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16476/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16735/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16736/
Test PASSed.

use_internal_ips: true

# Namespace to use for all resources created.
namespace:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why we need to expose the user to as much low level kubernetes API? Could the places where only a name is given be collapsed (e.g., namespace: ray) and handled by the node provider instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I originally tried hiding the Kubernetes details from the user but I think it would be less clear as they might not understand exactly how the fields map to Kubernetes resources, we would have to write more code to support different configuration options, and also users would have to learn another new syntax instead of just using the Kubernetes one. I expect that most users will just use the example config, so this shouldn't be much of a problem.

Copy link
Contributor

@gehring gehring Sep 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given there isn't much more to a namespace than its name, I don't think this is particularly true for namespace specifically. I don't think it is uncommon to see configs with a namespace: some-name entry. It also matches the kubectl api, e.g.,

kubectl apply -f config.yaml --namespace some-name

That being said, this is a mostly insignificant detail in the grand scheme of things so if you prefer the way it is now, I don't see a problem leaving it as is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, would definitely look cleaner that way. I will change it for the namespace field only.

resource: requests.cpu

# Kubernetes pod config for worker node pods.
worker_nodes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be outside the scope of this example but this seems like a good use case for kustomize. It has several useful features that would likely streamline the "customization" of the config and would reduce the amount of repeated yaml lines. It is also very simple to use, and it is packaged with kubectl so doesn't add any external dependencies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer! Not completely sure this is a good place to use it, but possibly in the manual Kubernetes deployment docs. Will keep this in mind.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16755/
Test PASSed.

Copy link
Collaborator

@robertnishihara robertnishihara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried it out and it seems to work well. I also tried running a simple AWS workload to check that the other autoscaler code paths are't broken.

I still haven't had a chance to read through it carefully.

# Get a remote screen on the head node.
$ ray attach ray/python/ray/autoscaler/gcp/example-full.yaml
$ source activate tensorflow_p36
$ # Try running a Ray program with 'ray.init(address="localhost:6379")'.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
$ # Try running a Ray program with 'ray.init(address="localhost:6379")'.
$ # Try running a Ray program with 'ray.init(address="localhost:6379")'
$ # inside of a Python interpreter.

Same comment applies to GCP and Kubernetes and private cluster.

@@ -0,0 +1,25 @@
#!/bin/bash

# Helper script to use kubectl as a remote shell for rsync to sync files
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, this is so hacky 💻

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16786/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/16787/
Test PASSed.

Copy link
Contributor

@simon-mo simon-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, when I run the example full:

(base) ubuntu@ip-172-31-17-39:~/ray$ ray up python/ray/autoscaler/kubernetes/example-full.yaml
WARNING: Not monitoring node memory since `psutil` is not installed. Install this with `pip install psutil` (or ray[debug]) to enable debugging of memory-related crashes.
/home/ubuntu/miniconda3/lib/python3.6/site-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.25.3) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/bin/ray", line 11, in <module>
    load_entry_point('ray', 'console_scripts', 'ray')()
  File "/home/ubuntu/ray/python/ray/scripts/scripts.py", line 787, in main
    return cli()
  File "/home/ubuntu/miniconda3/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/home/ubuntu/miniconda3/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/home/ubuntu/miniconda3/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ubuntu/miniconda3/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ubuntu/miniconda3/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/ubuntu/ray/python/ray/scripts/scripts.py", line 474, in create_or_update
    no_restart, restart_only, yes, cluster_name)
  File "/home/ubuntu/ray/python/ray/autoscaler/commands.py", line 45, in create_or_update_cluster
    config = _bootstrap_config(config)
  File "/home/ubuntu/ray/python/ray/autoscaler/commands.py", line 59, in _bootstrap_config
    validate_config(config)
  File "/home/ubuntu/ray/python/ray/autoscaler/autoscaler.py", line 800, in validate_config
    check_extraneous(config, schema)
  File "/home/ubuntu/ray/python/ray/autoscaler/autoscaler.py", line 791, in check_extraneous
    check_extraneous(config[k], v)
  File "/home/ubuntu/ray/python/ray/autoscaler/autoscaler.py", line 789, in check_extraneous
    type(config[k]).__name__, v.__name__))
ValueError: Config key `namespace` has wrong type str, expected dict

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/17089/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/17144/
Test FAILed.

@simon-mo
Copy link
Contributor

simon-mo commented Oct 2, 2019

nit. please install tmux and psutil in the example docker containers so address="auto" will work and ray attach works a bit better.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/17417/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/17422/
Test FAILed.

@edoakes edoakes merged commit 972dddd into ray-project:master Oct 3, 2019
@hartikainen
Copy link
Contributor

FYI @orybkin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants