Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler] Add support for separate docker containers on head and worker nodes #4537

Merged
merged 19 commits into from
Apr 7, 2019

Conversation

stefanpantic
Copy link
Contributor

What do these changes do?

  • Node specific docker parameters can now be specified when running a cluster with docker containers.
  • Added fields are:
    • head_image - image for head node, defaults to image if not specified
    • worker_image - same as head_image
    • head_run_options - appended to run_options on head node
    • worker_run_options - same as head_run_options

Related issue number

Linter

  • I've run scripts/format.sh to lint the changes in this PR.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13438/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13442/
Test FAILed.

@richardliaw richardliaw changed the title [ray]Add support for separate docker containers on head and worker nodes in ray auto-scaler [autoscaler] Add support for separate docker containers on head and worker nodes Apr 2, 2019
Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a glance this looks good; thanks for the contribution! @hartikainen can you try this out for GCP?

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13464/
Test FAILed.

@stefanpantic
Copy link
Contributor Author

@richardliaw I've resolved the discussion, could you take another look at the PR? 😄

@richardliaw
Copy link
Contributor

Just tried running this - I think this breaks some part of the autoscaler. I tried running ray up example-docker-gpu.yaml on both master and this branch, and this branch fails with:

2019-04-04 01:07:44,704	ERROR updater.py:140 -- NodeUpdater: i-0f16fdcf21b889cbb: Error updating (Exit Status 1) ssh -i /Users/rliaw/.ssh/ray-autoscaler_us-west-2.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_sockets/%C -o ControlPersist=5m ubuntu@34.217.214.200 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp35-cp35m-manylinux1_x86_64.whl'
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/Users/rliaw/miniconda3/envs/ray/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/Users/rliaw/miniconda3/envs/ray/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 143, in run
    raise e
  File "/Users/rliaw/miniconda3/envs/ray/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 132, in run
    self.do_update()
  File "/Users/rliaw/miniconda3/envs/ray/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 228, in do_update
    self.ssh_cmd(cmd)
  File "/Users/rliaw/miniconda3/envs/ray/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 291, in ssh_cmd
    stderr=redirect or sys.stderr)
  File "/Users/rliaw/miniconda3/envs/ray/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ssh', '-i', '/Users/rliaw/.ssh/ray-autoscaler_us-west-2.pem', '-o', 'ConnectTimeout=120s', '-o', 'StrictHostKeyChecking=no', '-o', 'ControlMaster=auto', '-o', 'ControlPath=/tmp/ray_ssh_sockets/%C', '-o', 'ControlPersist=5m', 'ubuntu@34.217.214.200', "bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp35-cp35m-manylinux1_x86_64.whl'"]' returned non-zero exit status 1.

2019-04-04 01:07:44,800	ERROR commands.py:260 -- get_or_create_head_node: Updating 34.217.214.200 failed

Note that the pip call is not executed within the docker instance (and should be). Can you take a look?

@stefanpantic
Copy link
Contributor Author

Just tried running this - I think this breaks some part of the autoscaler. I tried running ray up example-docker-gpu.yaml on both master and this branch, and this branch fails with:

2019-04-04 01:07:44,704	ERROR updater.py:140 -- NodeUpdater: i-0f16fdcf21b889cbb: Error updating (Exit Status 1) ssh -i /Users/rliaw/.ssh/ray-autoscaler_us-west-2.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_sockets/%C -o ControlPersist=5m ubuntu@34.217.214.200 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp35-cp35m-manylinux1_x86_64.whl'
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/Users/rliaw/miniconda3/envs/ray/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/Users/rliaw/miniconda3/envs/ray/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 143, in run
    raise e
  File "/Users/rliaw/miniconda3/envs/ray/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 132, in run
    self.do_update()
  File "/Users/rliaw/miniconda3/envs/ray/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 228, in do_update
    self.ssh_cmd(cmd)
  File "/Users/rliaw/miniconda3/envs/ray/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 291, in ssh_cmd
    stderr=redirect or sys.stderr)
  File "/Users/rliaw/miniconda3/envs/ray/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ssh', '-i', '/Users/rliaw/.ssh/ray-autoscaler_us-west-2.pem', '-o', 'ConnectTimeout=120s', '-o', 'StrictHostKeyChecking=no', '-o', 'ControlMaster=auto', '-o', 'ControlPath=/tmp/ray_ssh_sockets/%C', '-o', 'ControlPersist=5m', 'ubuntu@34.217.214.200', "bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp35-cp35m-manylinux1_x86_64.whl'"]' returned non-zero exit status 1.

2019-04-04 01:07:44,800	ERROR commands.py:260 -- get_or_create_head_node: Updating 34.217.214.200 failed

Note that the pip call is not executed within the docker instance (and should be). Can you take a look?

Could you post the entire startup log?

@stefanpantic
Copy link
Contributor Author

@richardliaw Try it now, it should be fixed. Also, I think that the wheel should be updated in the pip install part for one that includes these changes

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13618/
Test FAILed.

@toanngosy
Copy link
Contributor

toanngosy commented Apr 7, 2019

@stefanpantic I got the same problem while init a cluster on GCP, how to resolve it?

2019-04-07 19:32:28,366	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-07 19:32:33,583	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-07 19:32:38,781	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-07 19:32:43,982	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-07 19:32:46,609	INFO log_timer.py:21 -- NodeUpdater: ray-gpu-docker-head-78448c6a: Got SSH [LogTimer=134342ms]
2019-04-07 19:32:47,609	INFO node_provider.py:26 -- wait_for_compute_zone_operation: Waiting for operation operation-1554658367002-585f41bb655f3-6e875e6a-daafae19 to finish...
2019-04-07 19:32:54,568	INFO node_provider.py:37 -- wait_for_compute_zone_operation: Operation operation-1554658367002-585f41bb655f3-6e875e6a-daafae19 finished.
2019-04-07 19:32:54,569	INFO updater.py:198 -- NodeUpdater: ray-gpu-docker-head-78448c6a: Syncing /home/toanngo/.ssh/ray-autoscaler_gcp_us-west1_robustprothestics_ubuntu.pem to ~/ray_bootstrap_key.pem...
2019-04-07 19:32:54,569	INFO updater.py:268 -- NodeUpdater: Running mkdir -p ~ on 35.247.101.31...
2019-04-07 19:32:56,312	INFO log_timer.py:21 -- NodeUpdater ray-gpu-docker-head-78448c6a: Synced /home/toanngo/.ssh/ray-autoscaler_gcp_us-west1_robustprothestics_ubuntu.pem to ~/ray_bootstrap_key.pem [LogTimer=1743ms]
2019-04-07 19:32:56,313	INFO updater.py:198 -- NodeUpdater: ray-gpu-docker-head-78448c6a: Syncing /tmp/ray-bootstrap-0xyjtzqz to ~/ray_bootstrap_config.yaml...
2019-04-07 19:32:56,313	INFO updater.py:268 -- NodeUpdater: Running mkdir -p ~ on 35.247.101.31...
2019-04-07 19:32:58,087	INFO log_timer.py:21 -- NodeUpdater ray-gpu-docker-head-78448c6a: Synced /tmp/ray-bootstrap-0xyjtzqz to ~/ray_bootstrap_config.yaml [LogTimer=1774ms]
2019-04-07 19:33:02,973	INFO node_provider.py:26 -- wait_for_compute_zone_operation: Waiting for operation operation-1554658378463-585f41c653975-eac137dd-d211063a to finish...
2019-04-07 19:33:03,660	INFO node_provider.py:37 -- wait_for_compute_zone_operation: Operation operation-1554658378463-585f41c653975-eac137dd-d211063a finished.
2019-04-07 19:33:03,660	INFO updater.py:268 -- NodeUpdater: Running timeout 300 bash -c "
   command -v nvidia-smi && nvidia-smi
   until [ \$? -eq 0 ]; do
       command -v nvidia-smi && nvidia-smi
   done" on 35.247.101.31...
/usr/bin/nvidia-smi
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
Sun Apr  7 17:33:04 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.72       Driver Version: 410.72       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   60C    P0    87W / 149W |      0MiB / 11441MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
                                                                              
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
2019-04-07 19:33:06,037	INFO log_timer.py:21 -- NodeUpdater: ray-gpu-docker-head-78448c6a: Initialization commands completed [LogTimer=2376ms]
2019-04-07 19:33:06,038	INFO updater.py:268 -- NodeUpdater: Running docker inspect -f '{{.State.Running}}' ray-nvidia-docker || docker run --rm --name ray-nvidia-docker -d -it -p 6379:6379 -p 8076:8076 -p 4321:4321  -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --runtime=nvidia --net=host tensorflow/tensorflow:1.12.0-gpu-py3 bash on 35.247.101.31...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.
See 'docker run --help'.
2019-04-07 19:33:06,536	INFO log_timer.py:21 -- NodeUpdater: ray-gpu-docker-head-78448c6a: Setup commands completed [LogTimer=498ms]
2019-04-07 19:33:06,537	INFO log_timer.py:21 -- NodeUpdater: ray-gpu-docker-head-78448c6a: Applied config 8faf70d6bd22a9d01be381c24e34f89affc0457f [LogTimer=157004ms]
2019-04-07 19:33:06,537	ERROR updater.py:140 -- NodeUpdater: ray-gpu-docker-head-78448c6a: Error updating (Exit Status 125) ssh -i /home/toanngo/.ssh/ray-autoscaler_gcp_us-west1_robustprothestics_ubuntu.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_sockets/%C -o ControlPersist=5m ubuntu@35.247.101.31 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && docker inspect -f '"'"'{{.State.Running}}'"'"' ray-nvidia-docker || docker run --rm --name ray-nvidia-docker -d -it -p 6379:6379 -p 8076:8076 -p 4321:4321  -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --runtime=nvidia --net=host tensorflow/tensorflow:1.12.0-gpu-py3 bash'
2019-04-07 19:33:07,529	INFO node_provider.py:26 -- wait_for_compute_zone_operation: Waiting for operation operation-1554658386835-585f41ce4f8ce-e7c4771b-c6d5ef01 to finish...
2019-04-07 19:33:13,615	INFO node_provider.py:37 -- wait_for_compute_zone_operation: Operation operation-1554658386835-585f41ce4f8ce-e7c4771b-c6d5ef01 finished.
Exception in thread Thread-1:
Traceback (most recent call last):
 File "/home/toanngo/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
   self.run()
 File "/home/toanngo/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 143, in run
   raise e
 File "/home/toanngo/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 132, in run
   self.do_update()
 File "/home/toanngo/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 228, in do_update
   self.ssh_cmd(cmd)
 File "/home/toanngo/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 291, in ssh_cmd
   stderr=redirect or sys.stderr)
 File "/home/toanngo/anaconda3/lib/python3.6/subprocess.py", line 291, in check_call
   raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ssh', '-i', '/home/toanngo/.ssh/ray-autoscaler_gcp_us-west1_robustprothestics_ubuntu.pem', '-o', 'ConnectTimeout=120s', '-o', 'StrictHostKeyChecking=no', '-o', 'ControlMaster=auto', '-o', 'ControlPath=/tmp/ray_ssh_sockets/%C', '-o', 'ControlPersist=5m', 'ubuntu@35.247.101.31', 'bash --login -c -i \'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && docker inspect -f \'"\'"\'{{.State.Running}}\'"\'"\' ray-nvidia-docker || docker run --rm --name ray-nvidia-docker -d -it -p 6379:6379 -p 8076:8076 -p 4321:4321  -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --runtime=nvidia --net=host tensorflow/tensorflow:1.12.0-gpu-py3 bash\'']' returned non-zero exit status 125.

2019-04-07 19:33:13,899	ERROR commands.py:258 -- get_or_create_head_node: Updating 35.247.101.31 failed

Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome - this works on AWS. I used to following diff applied to example-gpu-docker.yaml

diff --git a/python/ray/autoscaler/aws/example-gpu-docker.yaml b/python/ray/autoscaler/aws/example-gpu-docker.yaml
index 962685390..af636022f 100644
--- a/python/ray/autoscaler/aws/example-gpu-docker.yaml
+++ b/python/ray/autoscaler/aws/example-gpu-docker.yaml
@@ -3,7 +3,7 @@ cluster_name: gpu-docker

 # The minimum number of workers nodes to launch in addition to the head
 # node. This number should be >= 0.
-min_workers: 0
+min_workers: 1

 # The maximum number of workers nodes to launch in addition to the head
 # node. This takes precedence over min_workers.
@@ -18,18 +18,18 @@ initial_workers: 0
 # and opens all the necessary ports to support the Ray cluster.
 # Empty string means disabled.
 docker:
-    image: "tensorflow/tensorflow:1.12.0-gpu-py3"
+    # image: "tensorflow/tensorflow:1.12.0-py3"
     container_name: "ray-nvidia-docker-test" # e.g. ray_docker
-    run_options:
-      - --runtime=nvidia
+    #run_options:
+    #  - --runtime=nvidia

     # # Example of running a GPU head with CPU workers
-    # head_image: "tensorflow/tensorflow:1.13.1-gpu-py3"
-    # head_run_options:
-    #     - --runtime=nvidia
+    head_image: "tensorflow/tensorflow:1.13.1-gpu-py3"
+    head_run_options:
+        - --runtime=nvidia

-    # worker_image: "tensorflow/tensorflow:1.13.1-py3"
-    # worker_run_options: []
+    worker_image: "tensorflow/tensorflow:1.13.1-py3"
+    worker_run_options: []
 # The autoscaler will scale up the cluster to this target fraction of resource
 # usage. For example, if a cluster of 10 nodes is 100% busy and
@@ -63,7 +63,7 @@ auth:
 # For more documentation on available fields, see:
 # http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
 head_node:
-    InstanceType: p2.xlarge
+    InstanceType: g3.4xlarge
     ImageId: ami-0b294f219d14e6a82 # Deep Learning AMI (Ubuntu) Version 21.0

     # You can provision additional disk space with a conf as follows
@@ -96,12 +96,14 @@ worker_nodes:
 file_mounts: {
 #    "/path1/on/remote/machine": "/path1/on/local/machine",
 #    "/path2/on/remote/machine": "/path2/on/local/machine",
+     "/home/ubuntu/ray_code/": "/Users/rliaw/Research/riselab/ray/python/ray/"
 }

 # List of shell commands to run to set up nodes.
 setup_commands:
     # - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp27-cp27mu-manylinux1_x86_64.whl
     - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp35-cp35m-manylinux1_x86_64.whl
+    - /home/ubuntu/ray_code/rllib/setup-rllib-dev.py --yes
     # - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp36-cp36m-manylinux1_x86_64.whl

 # Custom commands that will be run on the head node after common setup.

@hartikainen can you try this out and then merge if this works?

@hartikainen
Copy link
Contributor

hartikainen commented Apr 7, 2019

I tested this with the following config on GCP:

diff --git a/python/ray/autoscaler/gcp/example-gpu-docker.yaml b/python/ray/autoscaler/gcp/example-gpu-docker.yaml
index fa1face51f81..ae777845506a 100644
--- a/python/ray/autoscaler/gcp/example-gpu-docker.yaml
+++ b/python/ray/autoscaler/gcp/example-gpu-docker.yaml
@@ -3,7 +3,7 @@ cluster_name: gpu-docker
 
 # The minimum number of workers nodes to launch in addition to the head
 # node. This number should be >= 0.
-min_workers: 0
+min_workers: 1
 
 # The maximum number of workers nodes to launch in addition to the head
 # node. This takes precedence over min_workers.
@@ -18,11 +18,17 @@ initial_workers: 0
 # and opens all the necessary ports to support the Ray cluster.
 # Empty string means disabled.
 docker:
-    image: "tensorflow/tensorflow:1.12.0-gpu-py3"
+    # image: "tensorflow/tensorflow:1.12.0-gpu-py3"
     container_name: "ray-nvidia-docker-test" # e.g. ray_docker
-    run_options:
-      - --runtime=nvidia
+    # run_options:
+    #   - --runtime=nvidia
 
+    head_image: "tensorflow/tensorflow:1.13.1-gpu-py3"
+    head_run_options:
+        - --runtime=nvidia
+
+    worker_image: "tensorflow/tensorflow:1.13.1-py3"
+    worker_run_options: []
 
 # The autoscaler will scale up the cluster to this target fraction of resource
 # usage. For example, if a cluster of 10 nodes is 100% busy and
@@ -39,7 +45,7 @@ provider:
     type: gcp
     region: us-west1
     availability_zone: us-west1-b
-    project_id: <project_id> # Globally unique project id
+    project_id: project_id # Globally unique project id
 
 # How Ray will authenticate with newly launched nodes.
 auth:
@@ -65,7 +71,7 @@ head_node:
           # See https://cloud.google.com/compute/docs/images for more images
           sourceImage: projects/deeplearning-platform-release/global/images/family/tf-latest-gpu
     guestAccelerators:
-      - acceleratorType: projects/<project_id>/zones/us-west1-b/acceleratorTypes/nvidia-tesla-k80
+      - acceleratorType: projects/project_id/zones/us-west1-b/acceleratorTypes/nvidia-tesla-k80
         acceleratorCount: 1
     metadata:
       items:
@@ -87,13 +93,6 @@ worker_nodes:
           diskSizeGb: 50
           # See https://cloud.google.com/compute/docs/images for more images
           sourceImage: projects/deeplearning-platform-release/global/images/family/tf-latest-gpu
-    guestAccelerators:
-      - acceleratorType: projects/<project_id>/zones/us-west1-b/acceleratorTypes/nvidia-tesla-k80
-        acceleratorCount: 1
-    metadata:
-      items:
-        - key: install-nvidia-driver
-          value: "True"
     # Run workers on preemtible instance by default.
     # Comment this out to use on-demand.
     scheduling:
@@ -108,6 +107,7 @@ worker_nodes:
 file_mounts: {
 #    "/path1/on/remote/machine": "/path1/on/local/machine",
 #    "/path2/on/remote/machine": "/path2/on/local/machine",
+    "/home/ubuntu/ray_code/": "~/github/hartikainen/ray/python/ray/"
 }
 
 initialization_commands:
@@ -129,6 +129,7 @@ setup_commands:
     # Install ray
     # - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp27-cp27mu-manylinux1_x86_64.whl
     - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp35-cp35m-manylinux1_x86_64.whl
+    - /home/ubuntu/ray_code/rllib/setup-rllib-dev.py --yes
     # - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp36-cp36m-manylinux1_x86_64.whl
 
 # Custom commands that will be run on the head node after common setup.

Works fine. I'll merge this.

@hartikainen hartikainen merged commit 9154869 into ray-project:master Apr 7, 2019
@toanngosy
Copy link
Contributor

@hartikainen when I am trying to run again, I still got the problem

bash: no job control in this shell

then

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.

Do you think of any possible error that I could have?

@richardliaw
Copy link
Contributor

Can you post the full stack trace?

@stefanpantic
Copy link
Contributor Author

@hartikainen when I am trying to run again, I still got the problem

bash: no job control in this shell

then

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.

Do you think of any possible error that I could have?

You can safely ignore the bash warning, it's irrelevant.

@toanngosy
Copy link
Contributor

Can you post the full stack trace?

Here is the full trace @richard4912 , thanks.

ray up cluster.yaml -y                                                   
2019-04-08 10:36:17,038	INFO commands.py:189 -- get_or_create_head_node: Launching new head node...
2019-04-08 10:36:18,632	INFO node_provider.py:26 -- wait_for_compute_zone_operation: Waiting for operation operation-1554712577098-58600bae2c474-c0ff746f-881157eb to finish...
2019-04-08 10:36:52,368	INFO node_provider.py:37 -- wait_for_compute_zone_operation: Operation operation-1554712577098-58600bae2c474-c0ff746f-881157eb finished.
2019-04-08 10:36:52,633	INFO commands.py:202 -- get_or_create_head_node: Updating files on head node...
2019-04-08 10:36:52,637	INFO updater.py:128 -- NodeUpdater: ray-gpu-docker-head-01ccc301: Updating to 861d207745c2b9e999e1d6b20a2909416607a519
2019-04-08 10:36:53,739	INFO node_provider.py:26 -- wait_for_compute_zone_operation: Waiting for operation operation-1554712613043-58600bd073dad-8d7d3da2-a2621f59 to finish...
2019-04-08 10:36:59,441	INFO node_provider.py:37 -- wait_for_compute_zone_operation: Operation operation-1554712613043-58600bd073dad-8d7d3da2-a2621f59 finished.
2019-04-08 10:36:59,441	INFO updater.py:90 -- NodeUpdater: Waiting for IP of ray-gpu-docker-head-01ccc301...
2019-04-08 10:36:59,441	INFO log_timer.py:21 -- NodeUpdater: ray-gpu-docker-head-01ccc301: Got IP [LogTimer=0ms]
2019-04-08 10:36:59,454	INFO updater.py:155 -- NodeUpdater: ray-gpu-docker-head-01ccc301: Waiting for SSH...
2019-04-08 10:36:59,455	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:37:04,744	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:37:09,959	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:37:15,200	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:37:20,467	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:37:25,672	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:37:30,872	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:37:36,096	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:37:41,309	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:37:46,556	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:37:51,780	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:37:57,002	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:38:02,228	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:38:07,452	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:38:12,671	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:38:17,876	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:38:23,094	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:38:28,313	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:38:33,525	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:38:38,751	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:38:43,972	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:38:49,196	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:38:54,416	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:38:59,620	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:39:02,462	INFO log_timer.py:21 -- NodeUpdater: ray-gpu-docker-head-01ccc301: Got SSH [LogTimer=123007ms]
2019-04-08 10:39:03,544	INFO node_provider.py:26 -- wait_for_compute_zone_operation: Waiting for operation operation-1554712742801-58600c4c32fd6-4f8acd1a-c1fef099 to finish...
2019-04-08 10:39:09,690	INFO node_provider.py:37 -- wait_for_compute_zone_operation: Operation operation-1554712742801-58600c4c32fd6-4f8acd1a-c1fef099 finished.
2019-04-08 10:39:09,691	INFO updater.py:198 -- NodeUpdater: ray-gpu-docker-head-01ccc301: Syncing /home/toanngo/.ssh/ray-autoscaler_gcp_us-west1_robustprothestics_ubuntu.pem to ~/ray_bootstrap_key.pem...
2019-04-08 10:39:09,692	INFO updater.py:268 -- NodeUpdater: Running mkdir -p ~ on 35.247.101.31...
2019-04-08 10:39:11,492	INFO log_timer.py:21 -- NodeUpdater ray-gpu-docker-head-01ccc301: Synced /home/toanngo/.ssh/ray-autoscaler_gcp_us-west1_robustprothestics_ubuntu.pem to ~/ray_bootstrap_key.pem [LogTimer=1800ms]
2019-04-08 10:39:11,493	INFO updater.py:198 -- NodeUpdater: ray-gpu-docker-head-01ccc301: Syncing /tmp/ray-bootstrap-fmcaxfby to ~/ray_bootstrap_config.yaml...
2019-04-08 10:39:11,494	INFO updater.py:268 -- NodeUpdater: Running mkdir -p ~ on 35.247.101.31...
2019-04-08 10:39:13,255	INFO log_timer.py:21 -- NodeUpdater ray-gpu-docker-head-01ccc301: Synced /tmp/ray-bootstrap-fmcaxfby to ~/ray_bootstrap_config.yaml [LogTimer=1761ms]
2019-04-08 10:39:14,226	INFO node_provider.py:26 -- wait_for_compute_zone_operation: Waiting for operation operation-1554712753584-58600c567b938-c2d59239-7dac8e71 to finish...
2019-04-08 10:39:20,133	INFO node_provider.py:37 -- wait_for_compute_zone_operation: Operation operation-1554712753584-58600c567b938-c2d59239-7dac8e71 finished.
2019-04-08 10:39:20,133	INFO updater.py:268 -- NodeUpdater: Running timeout 300 bash -c "
    command -v nvidia-smi && nvidia-smi
    until [ \$? -eq 0 ]; do
        command -v nvidia-smi && nvidia-smi
    done" on 35.247.101.31...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
/usr/bin/nvidia-smi
Mon Apr  8 08:39:21 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.72       Driver Version: 410.72       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   63C    P0    86W / 149W |      0MiB / 11441MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
2019-04-08 10:39:22,592	INFO log_timer.py:21 -- NodeUpdater: ray-gpu-docker-head-01ccc301: Initialization commands completed [LogTimer=2458ms]
2019-04-08 10:39:22,592	INFO updater.py:268 -- NodeUpdater: Running docker inspect -f '{{.State.Running}}' ray-nvidia-docker-test || docker run --rm --name ray-nvidia-docker-test -d -it -p 6379:6379 -p 8076:8076 -p 4321:4321  -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --runtime=nvidia --net=host tensorflow/tensorflow:1.13.1-gpu-py3 bash on 35.247.101.31...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.
See 'docker run --help'.
2019-04-08 10:39:23,085	INFO log_timer.py:21 -- NodeUpdater: ray-gpu-docker-head-01ccc301: Setup commands completed [LogTimer=493ms]
2019-04-08 10:39:23,086	INFO log_timer.py:21 -- NodeUpdater: ray-gpu-docker-head-01ccc301: Applied config 861d207745c2b9e999e1d6b20a2909416607a519 [LogTimer=150449ms]
2019-04-08 10:39:23,087	ERROR updater.py:140 -- NodeUpdater: ray-gpu-docker-head-01ccc301: Error updating (Exit Status 125) ssh -i /home/toanngo/.ssh/ray-autoscaler_gcp_us-west1_robustprothestics_ubuntu.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_sockets/%C -o ControlPersist=5m ubuntu@35.247.101.31 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && docker inspect -f '"'"'{{.State.Running}}'"'"' ray-nvidia-docker-test || docker run --rm --name ray-nvidia-docker-test -d -it -p 6379:6379 -p 8076:8076 -p 4321:4321  -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --runtime=nvidia --net=host tensorflow/tensorflow:1.13.1-gpu-py3 bash'
2019-04-08 10:39:24,174	INFO node_provider.py:26 -- wait_for_compute_zone_operation: Waiting for operation operation-1554712763442-58600c5fe2559-39e99e1d-cda7b998 to finish...
2019-04-08 10:39:30,272	INFO node_provider.py:37 -- wait_for_compute_zone_operation: Operation operation-1554712763442-58600c5fe2559-39e99e1d-cda7b998 finished.
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/home/toanngo/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/toanngo/Documents/GitHub/ray/python/ray/autoscaler/updater.py", line 143, in run
    raise e
  File "/home/toanngo/Documents/GitHub/ray/python/ray/autoscaler/updater.py", line 132, in run
    self.do_update()
  File "/home/toanngo/Documents/GitHub/ray/python/ray/autoscaler/updater.py", line 228, in do_update
    self.ssh_cmd(cmd)
  File "/home/toanngo/Documents/GitHub/ray/python/ray/autoscaler/updater.py", line 291, in ssh_cmd
    stderr=redirect or sys.stderr)
  File "/home/toanngo/anaconda3/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ssh', '-i', '/home/toanngo/.ssh/ray-autoscaler_gcp_us-west1_robustprothestics_ubuntu.pem', '-o', 'ConnectTimeout=120s', '-o', 'StrictHostKeyChecking=no', '-o', 'ControlMaster=auto', '-o', 'ControlPath=/tmp/ray_ssh_sockets/%C', '-o', 'ControlPersist=5m', 'ubuntu@35.247.101.31', 'bash --login -c -i \'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && docker inspect -f \'"\'"\'{{.State.Running}}\'"\'"\' ray-nvidia-docker-test || docker run --rm --name ray-nvidia-docker-test -d -it -p 6379:6379 -p 8076:8076 -p 4321:4321  -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --runtime=nvidia --net=host tensorflow/tensorflow:1.13.1-gpu-py3 bash\'']' returned non-zero exit status 125.

2019-04-08 10:39:30,522	ERROR commands.py:258 -- get_or_create_head_node: Updating 35.247.101.31 failed

@toanngosy
Copy link
Contributor

@hartikainen when I am trying to run again, I still got the problem

bash: no job control in this shell

then

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.

Do you think of any possible error that I could have?

You can safely ignore the bash warning, it's irrelevant.
sorry I missed a line,

bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell

Maybe I have a problem with docker. After above, I ran ray up cluster.yaml again to update the cluster and these bash warning show up multiple times.

@stefanpantic
Copy link
Contributor Author

@hartikainen when I am trying to run again, I still got the problem

bash: no job control in this shell

then

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.

Do you think of any possible error that I could have?

You can safely ignore the bash warning, it's irrelevant.
sorry I missed a line,

bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell

Maybe I have a problem with docker. After above, I ran ray up cluster.yaml again to update the cluster and these bash warning show up multiple times.

Yes they allways come in pairs, you can safely ignore them. You can read about that here. I think that there might be a problem with your docker daemon. Could you post the output of

ps -ef | grep docker

@toanngosy
Copy link
Contributor

@hartikainen when I am trying to run again, I still got the problem

bash: no job control in this shell

then

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.

Do you think of any possible error that I could have?

You can safely ignore the bash warning, it's irrelevant.
sorry I missed a line,

bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell

Maybe I have a problem with docker. After above, I ran ray up cluster.yaml again to update the cluster and these bash warning show up multiple times.

Yes they allways come in pairs, you can safely ignore them. You can read about that here. I think that there might be a problem with your docker daemon. Could you post the output of

ps -ef | grep docker
ubuntu@ray-gpu-docker-head-a9e49086:~$ ps -ef | grep docker
root     31839     1  0 08:54 ?        00:00:00 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
ubuntu   32011 32007  0 08:56 pts/1    00:00:00 grep docker

This is the result executed on the server

@stefanpantic
Copy link
Contributor Author

@hartikainen when I am trying to run again, I still got the problem

bash: no job control in this shell

then

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.

Do you think of any possible error that I could have?

You can safely ignore the bash warning, it's irrelevant.
sorry I missed a line,

bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell

Maybe I have a problem with docker. After above, I ran ray up cluster.yaml again to update the cluster and these bash warning show up multiple times.

Yes they allways come in pairs, you can safely ignore them. You can read about that here. I think that there might be a problem with your docker daemon. Could you post the output of

ps -ef | grep docker
ubuntu@ray-gpu-docker-head-a9e49086:~$ ps -ef | grep docker
root     31839     1  0 08:54 ?        00:00:00 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
ubuntu   32011 32007  0 08:56 pts/1    00:00:00 grep docker

This is the result executed on the server

The user that you are running as may not have permissions to talk to /var/run/docker.sock on that system.

@toanngosy
Copy link
Contributor

@hartikainen when I am trying to run again, I still got the problem

bash: no job control in this shell

then

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.

Do you think of any possible error that I could have?

You can safely ignore the bash warning, it's irrelevant.
sorry I missed a line,

bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell

Maybe I have a problem with docker. After above, I ran ray up cluster.yaml again to update the cluster and these bash warning show up multiple times.

Yes they allways come in pairs, you can safely ignore them. You can read about that here. I think that there might be a problem with your docker daemon. Could you post the output of

ps -ef | grep docker
ubuntu@ray-gpu-docker-head-a9e49086:~$ ps -ef | grep docker
root     31839     1  0 08:54 ?        00:00:00 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
ubuntu   32011 32007  0 08:56 pts/1    00:00:00 grep docker

This is the result executed on the server

The user that you are running as may not have permissions to talk to /var/run/docker.sock on that system.

I changed in docker.py and make it to sudo docker ... but it still cannot talk. What should I do?

@stefanpantic
Copy link
Contributor Author

@hartikainen when I am trying to run again, I still got the problem

bash: no job control in this shell

then

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.

Do you think of any possible error that I could have?

You can safely ignore the bash warning, it's irrelevant.
sorry I missed a line,

bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell

Maybe I have a problem with docker. After above, I ran ray up cluster.yaml again to update the cluster and these bash warning show up multiple times.

Yes they allways come in pairs, you can safely ignore them. You can read about that here. I think that there might be a problem with your docker daemon. Could you post the output of

ps -ef | grep docker
ubuntu@ray-gpu-docker-head-a9e49086:~$ ps -ef | grep docker
root     31839     1  0 08:54 ?        00:00:00 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
ubuntu   32011 32007  0 08:56 pts/1    00:00:00 grep docker

This is the result executed on the server

The user that you are running as may not have permissions to talk to /var/run/docker.sock on that system.

I changed in docker.py and make it to sudo docker ... but it still cannot talk. What should I do?

Could you send me the cluster.yaml you are trying to create.

@toanngosy
Copy link
Contributor

toanngosy commented Apr 8, 2019

@hartikainen when I am trying to run again, I still got the problem

bash: no job control in this shell

then

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.

Do you think of any possible error that I could have?

You can safely ignore the bash warning, it's irrelevant.
sorry I missed a line,

bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell

Maybe I have a problem with docker. After above, I ran ray up cluster.yaml again to update the cluster and these bash warning show up multiple times.

Yes they allways come in pairs, you can safely ignore them. You can read about that here. I think that there might be a problem with your docker daemon. Could you post the output of

ps -ef | grep docker
ubuntu@ray-gpu-docker-head-a9e49086:~$ ps -ef | grep docker
root     31839     1  0 08:54 ?        00:00:00 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
ubuntu   32011 32007  0 08:56 pts/1    00:00:00 grep docker

This is the result executed on the server

The user that you are running as may not have permissions to talk to /var/run/docker.sock on that system.

I changed in docker.py and make it to sudo docker ... but it still cannot talk. What should I do?

Could you send me the cluster.yaml you are trying to create.

Here is my cluster.yaml I am trying to create

# An unique identifier for the head node and workers of this cluster.
cluster_name: gpu-docker

# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
min_workers: 1

# The maximum number of workers nodes to launch in addition to the head
# node. This takes precedence over min_workers.
max_workers: 2

# The initial number of worker nodes to launch in addition to the head
# node. When the cluster is first brought up (or when it is refreshed with a
# subsequent `ray up`) this number of nodes will be started.
initial_workers: 1

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
docker:
    container_name: "ray-nvidia-docker-test" # e.g. ray_docker

    head_image: "tensorflow/tensorflow:1.13.1-gpu-py3"
    head_run_options:
      - --runtime=nvidia

    worker_image: "tensorflow/tensorflow:1.13.1-py3"
    worker_run_options: []

# The autoscaler will scale up the cluster to this target fraction of resource
# usage. For example, if a cluster of 10 nodes is 100% busy and
# target_utilization is 0.8, it would resize the cluster to 13. This fraction
# can be decreased to increase the aggressiveness of upscaling.
# This value must be less than 1.0 for scaling to happen.
target_utilization_fraction: 0.8

# If a node is idle for this many minutes, it will be removed.
#idle_timeout_minutes: 5

# Cloud-provider specific configuration.
provider:
    type: gcp
    region: us-west1
    availability_zone: us-west1-b
    project_id: robustprothestics # Globally unique project id

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below. This requires that you have added the key into the
# project wide meta-data.
#    ssh_private_key: /path/to/your/key.pem

# Provider-specific config for the head node, e.g. instance type. By default
# Ray will auto-configure unspecified fields such as subnets and ssh-keys.
# For more documentation on available fields, see:
# https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
head_node:
    machineType: n1-standard-4
    disks:
      - boot: true
        autoDelete: true
        type: PERSISTENT
        initializeParams:
          diskSizeGb: 50
          # See https://cloud.google.com/compute/docs/images for more images
          sourceImage: projects/deeplearning-platform-release/global/images/family/tf-latest-gpu
    guestAccelerators:
      - acceleratorType: projects/robustprothestics/zones/us-west1-b/acceleratorTypes/nvidia-tesla-k80
        acceleratorCount: 1
    metadata:
      items:
        - key: install-nvidia-driver
          value: "True"
    scheduling:
      - onHostMaintenance: TERMINATE

    # Additional options can be found in in the compute docs at
    # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert

worker_nodes:
    machineType: n1-standard-4
    disks:
      - boot: true
        autoDelete: true
        type: PERSISTENT
        initializeParams:
          diskSizeGb: 50
          # See https://cloud.google.com/compute/docs/images for more images
          sourceImage: projects/deeplearning-platform-release/global/images/family/tf-latest-gpu
    # Run workers on preemtible instance by default.
    # Comment this out to use on-demand.
    scheduling:
      - preemptible: true
      - onHostMaintenance: TERMINATE

    # Additional options can be found in in the compute docs at
    # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
}

initialization_commands:
    # Wait until nvidia drivers are installed
    - >-
      timeout 300 bash -c "
          command -v nvidia-smi && nvidia-smi
          until [ \$? -eq 0 ]; do
              command -v nvidia-smi && nvidia-smi
          done"
# List of shell commands to run to set up nodes.
setup_commands:
    # Note: if you're developing Ray, you probably want to create an AMI that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # - echo 'export PATH="$HOME/anaconda3/envs/tensorflow_p36/bin:$PATH"' >> ~/.bashrc

    # Install ray
    # - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp27-cp27mu-manylinux1_x86_64.whl
    - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp35-cp35m-manylinux1_x86_64.whl
    # - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp36-cp36m-manylinux1_x86_64.whl

# Custom commands that will be run on the head node after common setup.
head_setup_commands:
  - pip install google-api-python-client==1.7.8

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --head
      --redis-port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --redis-address=$RAY_HEAD_IP:6379
      --object-manager-port=8076

Do I have to grant ray anything related to docker in IAM?

@guoyuhong
Copy link
Contributor

This PR introduced python Lint failure to master.

@hartikainen
Copy link
Contributor

Sorry, that's my bad. Should be fixed by #4584.

@toanngosy
Copy link
Contributor

@hartikainen have you encountered the docker problem I had above?

@hartikainen
Copy link
Contributor

@toanngosy unfortunately I have not seen the error before. I just tested your configuration and everything runs smoothly. Here's the command I ran:

AUTOSCALER_CONFIG_FILE=${RAY_PATH}/python/ray/autoscaler/gcp/example-2.yaml \
  && ray down -y ${AUTOSCALER_CONFIG_FILE} \
  && ray exec ${AUTOSCALER_CONFIG_FILE} \
    --docker \
    'python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"' \
    --start

@toanngosy
Copy link
Contributor

@toanngosy unfortunately I have not seen the error before. I just tested your configuration and everything runs smoothly. Here's the command I ran:

AUTOSCALER_CONFIG_FILE=${RAY_PATH}/python/ray/autoscaler/gcp/example-2.yaml \
  && ray down -y ${AUTOSCALER_CONFIG_FILE} \
  && ray exec ${AUTOSCALER_CONFIG_FILE} \
    --docker \
    'python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"' \
    --start

I thought I messed up with the ssh and all the permission of GCP, so I deleted everything and start fresh. It is working smoothly now. Thanks @hartikainen

guoyuhong pushed a commit that referenced this pull request Apr 10, 2019
* Lint code that we forgot to lint in previous PR

* Revert setup command merge

* Lint

* Revert "Revert setup command merge"

This reverts commit 55e1cdb.

* Fix testReportsConfigFailures test

* Minor syntax tweaks

* Lint
@stefanpantic stefanpantic deleted the gpu-cpu-docker-cluster branch April 19, 2019 09:20
@stefanpantic stefanpantic restored the gpu-cpu-docker-cluster branch April 19, 2019 09:20
@stefanpantic stefanpantic deleted the gpu-cpu-docker-cluster branch April 19, 2019 09:43
@AdamGleave AdamGleave mentioned this pull request Dec 9, 2019
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants