-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[autoscaler] Kubernetes autoscaler backend #5492
Conversation
Test FAILed. |
Test PASSed. |
Test PASSed. |
Hey @edoakes, in this PR, can you make sure to update the relevant files of the documentation? (Marking things as experimental as necessary). |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test FAILed. |
Test PASSed. |
Test PASSed. |
use_internal_ips: true | ||
|
||
# Namespace to use for all resources created. | ||
namespace: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why we need to expose the user to as much low level kubernetes API? Could the places where only a name is given be collapsed (e.g., namespace: ray
) and handled by the node provider instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I originally tried hiding the Kubernetes details from the user but I think it would be less clear as they might not understand exactly how the fields map to Kubernetes resources, we would have to write more code to support different configuration options, and also users would have to learn another new syntax instead of just using the Kubernetes one. I expect that most users will just use the example config, so this shouldn't be much of a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given there isn't much more to a namespace than its name, I don't think this is particularly true for namespace specifically. I don't think it is uncommon to see configs with a namespace: some-name
entry. It also matches the kubectl api, e.g.,
kubectl apply -f config.yaml --namespace some-name
That being said, this is a mostly insignificant detail in the grand scheme of things so if you prefer the way it is now, I don't see a problem leaving it as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, would definitely look cleaner that way. I will change it for the namespace field only.
resource: requests.cpu | ||
|
||
# Kubernetes pod config for worker node pods. | ||
worker_nodes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be outside the scope of this example but this seems like a good use case for kustomize. It has several useful features that would likely streamline the "customization" of the config and would reduce the amount of repeated yaml lines. It is also very simple to use, and it is packaged with kubectl
so doesn't add any external dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointer! Not completely sure this is a good place to use it, but possibly in the manual Kubernetes deployment docs. Will keep this in mind.
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried it out and it seems to work well. I also tried running a simple AWS workload to check that the other autoscaler code paths are't broken.
I still haven't had a chance to read through it carefully.
doc/source/autoscaling.rst
Outdated
# Get a remote screen on the head node. | ||
$ ray attach ray/python/ray/autoscaler/gcp/example-full.yaml | ||
$ source activate tensorflow_p36 | ||
$ # Try running a Ray program with 'ray.init(address="localhost:6379")'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$ # Try running a Ray program with 'ray.init(address="localhost:6379")'. | |
$ # Try running a Ray program with 'ray.init(address="localhost:6379")' | |
$ # inside of a Python interpreter. |
Same comment applies to GCP and Kubernetes and private cluster.
@@ -0,0 +1,25 @@ | |||
#!/bin/bash | |||
|
|||
# Helper script to use kubectl as a remote shell for rsync to sync files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, this is so hacky 💻
Test PASSed. |
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, when I run the example full:
(base) ubuntu@ip-172-31-17-39:~/ray$ ray up python/ray/autoscaler/kubernetes/example-full.yaml
WARNING: Not monitoring node memory since `psutil` is not installed. Install this with `pip install psutil` (or ray[debug]) to enable debugging of memory-related crashes.
/home/ubuntu/miniconda3/lib/python3.6/site-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.25.3) or chardet (3.0.4) doesn't match a supported version!
RequestsDependencyWarning)
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/bin/ray", line 11, in <module>
load_entry_point('ray', 'console_scripts', 'ray')()
File "/home/ubuntu/ray/python/ray/scripts/scripts.py", line 787, in main
return cli()
File "/home/ubuntu/miniconda3/lib/python3.6/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/home/ubuntu/miniconda3/lib/python3.6/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/home/ubuntu/miniconda3/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ubuntu/miniconda3/lib/python3.6/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ubuntu/miniconda3/lib/python3.6/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/home/ubuntu/ray/python/ray/scripts/scripts.py", line 474, in create_or_update
no_restart, restart_only, yes, cluster_name)
File "/home/ubuntu/ray/python/ray/autoscaler/commands.py", line 45, in create_or_update_cluster
config = _bootstrap_config(config)
File "/home/ubuntu/ray/python/ray/autoscaler/commands.py", line 59, in _bootstrap_config
validate_config(config)
File "/home/ubuntu/ray/python/ray/autoscaler/autoscaler.py", line 800, in validate_config
check_extraneous(config, schema)
File "/home/ubuntu/ray/python/ray/autoscaler/autoscaler.py", line 791, in check_extraneous
check_extraneous(config[k], v)
File "/home/ubuntu/ray/python/ray/autoscaler/autoscaler.py", line 789, in check_extraneous
type(config[k]).__name__, v.__name__))
ValueError: Config key `namespace` has wrong type str, expected dict
Test FAILed. |
Test FAILed. |
nit. please install |
Test FAILed. |
Test FAILed. |
FYI @orybkin |
Why are these changes needed?
Currently, the only way to run a Ray cluster on Kubernetes is to manually deploy Kubernetes configs containing the head and worker nodes.
What do these changes do?
Adds Kubernetes as a backend to the autoscaler so that
ray up
can be used to start a Ray cluster on top of a shared Kubernetes cluster.The head node runs as a pod with permissions to create, watch, and update the worker pods. Permissions are automatically configured by the autoscaler, which can be run from any machine that has access to the Kubernetes cluster (i.e., can successfully run
kubectl
commands against it).Related issue number
#1437
Linter
scripts/format.sh
to lint the changes in this PR.