-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[autoscaler] Add support for separate docker containers on head and worker nodes #4537
[autoscaler] Add support for separate docker containers on head and worker nodes #4537
Conversation
…cpu-docker-cluster
…cpu-docker-cluster
Can one of the admins verify this patch? |
Test FAILed. |
Test FAILed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a glance this looks good; thanks for the contribution! @hartikainen can you try this out for GCP?
Test FAILed. |
@richardliaw I've resolved the discussion, could you take another look at the PR? 😄 |
Just tried running this - I think this breaks some part of the autoscaler. I tried running
Note that the pip call is not executed within the docker instance (and should be). Can you take a look? |
Could you post the entire startup log? |
@richardliaw Try it now, it should be fixed. Also, I think that the wheel should be updated in the pip install part for one that includes these changes |
Test FAILed. |
@stefanpantic I got the same problem while init a cluster on GCP, how to resolve it?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome - this works on AWS. I used to following diff applied to example-gpu-docker.yaml
diff --git a/python/ray/autoscaler/aws/example-gpu-docker.yaml b/python/ray/autoscaler/aws/example-gpu-docker.yaml
index 962685390..af636022f 100644
--- a/python/ray/autoscaler/aws/example-gpu-docker.yaml
+++ b/python/ray/autoscaler/aws/example-gpu-docker.yaml
@@ -3,7 +3,7 @@ cluster_name: gpu-docker
# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
-min_workers: 0
+min_workers: 1
# The maximum number of workers nodes to launch in addition to the head
# node. This takes precedence over min_workers.
@@ -18,18 +18,18 @@ initial_workers: 0
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
docker:
- image: "tensorflow/tensorflow:1.12.0-gpu-py3"
+ # image: "tensorflow/tensorflow:1.12.0-py3"
container_name: "ray-nvidia-docker-test" # e.g. ray_docker
- run_options:
- - --runtime=nvidia
+ #run_options:
+ # - --runtime=nvidia
# # Example of running a GPU head with CPU workers
- # head_image: "tensorflow/tensorflow:1.13.1-gpu-py3"
- # head_run_options:
- # - --runtime=nvidia
+ head_image: "tensorflow/tensorflow:1.13.1-gpu-py3"
+ head_run_options:
+ - --runtime=nvidia
- # worker_image: "tensorflow/tensorflow:1.13.1-py3"
- # worker_run_options: []
+ worker_image: "tensorflow/tensorflow:1.13.1-py3"
+ worker_run_options: []
# The autoscaler will scale up the cluster to this target fraction of resource
# usage. For example, if a cluster of 10 nodes is 100% busy and
@@ -63,7 +63,7 @@ auth:
# For more documentation on available fields, see:
# http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
head_node:
- InstanceType: p2.xlarge
+ InstanceType: g3.4xlarge
ImageId: ami-0b294f219d14e6a82 # Deep Learning AMI (Ubuntu) Version 21.0
# You can provision additional disk space with a conf as follows
@@ -96,12 +96,14 @@ worker_nodes:
file_mounts: {
# "/path1/on/remote/machine": "/path1/on/local/machine",
# "/path2/on/remote/machine": "/path2/on/local/machine",
+ "/home/ubuntu/ray_code/": "/Users/rliaw/Research/riselab/ray/python/ray/"
}
# List of shell commands to run to set up nodes.
setup_commands:
# - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp27-cp27mu-manylinux1_x86_64.whl
- pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp35-cp35m-manylinux1_x86_64.whl
+ - /home/ubuntu/ray_code/rllib/setup-rllib-dev.py --yes
# - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp36-cp36m-manylinux1_x86_64.whl
# Custom commands that will be run on the head node after common setup.
@hartikainen can you try this out and then merge if this works?
I tested this with the following config on GCP: diff --git a/python/ray/autoscaler/gcp/example-gpu-docker.yaml b/python/ray/autoscaler/gcp/example-gpu-docker.yaml
index fa1face51f81..ae777845506a 100644
--- a/python/ray/autoscaler/gcp/example-gpu-docker.yaml
+++ b/python/ray/autoscaler/gcp/example-gpu-docker.yaml
@@ -3,7 +3,7 @@ cluster_name: gpu-docker
# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
-min_workers: 0
+min_workers: 1
# The maximum number of workers nodes to launch in addition to the head
# node. This takes precedence over min_workers.
@@ -18,11 +18,17 @@ initial_workers: 0
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
docker:
- image: "tensorflow/tensorflow:1.12.0-gpu-py3"
+ # image: "tensorflow/tensorflow:1.12.0-gpu-py3"
container_name: "ray-nvidia-docker-test" # e.g. ray_docker
- run_options:
- - --runtime=nvidia
+ # run_options:
+ # - --runtime=nvidia
+ head_image: "tensorflow/tensorflow:1.13.1-gpu-py3"
+ head_run_options:
+ - --runtime=nvidia
+
+ worker_image: "tensorflow/tensorflow:1.13.1-py3"
+ worker_run_options: []
# The autoscaler will scale up the cluster to this target fraction of resource
# usage. For example, if a cluster of 10 nodes is 100% busy and
@@ -39,7 +45,7 @@ provider:
type: gcp
region: us-west1
availability_zone: us-west1-b
- project_id: <project_id> # Globally unique project id
+ project_id: project_id # Globally unique project id
# How Ray will authenticate with newly launched nodes.
auth:
@@ -65,7 +71,7 @@ head_node:
# See https://cloud.google.com/compute/docs/images for more images
sourceImage: projects/deeplearning-platform-release/global/images/family/tf-latest-gpu
guestAccelerators:
- - acceleratorType: projects/<project_id>/zones/us-west1-b/acceleratorTypes/nvidia-tesla-k80
+ - acceleratorType: projects/project_id/zones/us-west1-b/acceleratorTypes/nvidia-tesla-k80
acceleratorCount: 1
metadata:
items:
@@ -87,13 +93,6 @@ worker_nodes:
diskSizeGb: 50
# See https://cloud.google.com/compute/docs/images for more images
sourceImage: projects/deeplearning-platform-release/global/images/family/tf-latest-gpu
- guestAccelerators:
- - acceleratorType: projects/<project_id>/zones/us-west1-b/acceleratorTypes/nvidia-tesla-k80
- acceleratorCount: 1
- metadata:
- items:
- - key: install-nvidia-driver
- value: "True"
# Run workers on preemtible instance by default.
# Comment this out to use on-demand.
scheduling:
@@ -108,6 +107,7 @@ worker_nodes:
file_mounts: {
# "/path1/on/remote/machine": "/path1/on/local/machine",
# "/path2/on/remote/machine": "/path2/on/local/machine",
+ "/home/ubuntu/ray_code/": "~/github/hartikainen/ray/python/ray/"
}
initialization_commands:
@@ -129,6 +129,7 @@ setup_commands:
# Install ray
# - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp27-cp27mu-manylinux1_x86_64.whl
- pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp35-cp35m-manylinux1_x86_64.whl
+ - /home/ubuntu/ray_code/rllib/setup-rllib-dev.py --yes
# - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp36-cp36m-manylinux1_x86_64.whl
# Custom commands that will be run on the head node after common setup. Works fine. I'll merge this. |
@hartikainen when I am trying to run again, I still got the problem
then
Do you think of any possible error that I could have? |
Can you post the full stack trace? |
You can safely ignore the bash warning, it's irrelevant. |
Here is the full trace @richard4912 , thanks.
|
Maybe I have a problem with docker. After above, I ran |
Yes they allways come in pairs, you can safely ignore them. You can read about that here. I think that there might be a problem with your docker daemon. Could you post the output of
|
This is the result executed on the server |
The user that you are running as may not have permissions to talk to /var/run/docker.sock on that system. |
I changed in docker.py and make it to |
Could you send me the cluster.yaml you are trying to create. |
Here is my cluster.yaml I am trying to create
Do I have to grant ray anything related to docker in IAM? |
This PR introduced python Lint failure to master. |
Sorry, that's my bad. Should be fixed by #4584. |
@hartikainen have you encountered the docker problem I had above? |
@toanngosy unfortunately I have not seen the error before. I just tested your configuration and everything runs smoothly. Here's the command I ran:
|
I thought I messed up with the ssh and all the permission of GCP, so I deleted everything and start fresh. It is working smoothly now. Thanks @hartikainen |
What do these changes do?
head_image
- image for head node, defaults toimage
if not specifiedworker_image
- same ashead_image
head_run_options
- appended torun_options
on head nodeworker_run_options
- same ashead_run_options
Related issue number
Linter
scripts/format.sh
to lint the changes in this PR.