[autoscaler] Add support for separate docker containers on head and worker nodes #4537

stefanpantic · 2019-04-02T09:54:19Z

What do these changes do?

Node specific docker parameters can now be specified when running a cluster with docker containers.
Added fields are:
- head_image - image for head node, defaults to image if not specified
- worker_image - same as head_image
- head_run_options - appended to run_options on head node
- worker_run_options - same as head_run_options

Related issue number

Linter

I've run scripts/format.sh to lint the changes in this PR.

…cpu-docker-cluster

AmplabJenkins · 2019-04-02T09:56:55Z

Can one of the admins verify this patch?

AmplabJenkins · 2019-04-02T12:58:15Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13438/
Test FAILed.

AmplabJenkins · 2019-04-02T15:22:46Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13442/
Test FAILed.

python/ray/autoscaler/aws/example-full.yaml

richardliaw

At a glance this looks good; thanks for the contribution! @hartikainen can you try this out for GCP?

python/ray/autoscaler/docker.py

…cpu-docker-cluster

AmplabJenkins · 2019-04-03T10:26:42Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13464/
Test FAILed.

stefanpantic · 2019-04-04T06:45:41Z

@richardliaw I've resolved the discussion, could you take another look at the PR? 😄

python/ray/autoscaler/aws/example-full.yaml

python/ray/autoscaler/aws/example-gpu-docker.yaml

…cpu-docker-cluster

richardliaw · 2019-04-04T08:09:20Z

Just tried running this - I think this breaks some part of the autoscaler. I tried running ray up example-docker-gpu.yaml on both master and this branch, and this branch fails with:

2019-04-04 01:07:44,704	ERROR updater.py:140 -- NodeUpdater: i-0f16fdcf21b889cbb: Error updating (Exit Status 1) ssh -i /Users/rliaw/.ssh/ray-autoscaler_us-west-2.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_sockets/%C -o ControlPersist=5m ubuntu@34.217.214.200 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp35-cp35m-manylinux1_x86_64.whl'
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/Users/rliaw/miniconda3/envs/ray/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/Users/rliaw/miniconda3/envs/ray/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 143, in run
    raise e
  File "/Users/rliaw/miniconda3/envs/ray/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 132, in run
    self.do_update()
  File "/Users/rliaw/miniconda3/envs/ray/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 228, in do_update
    self.ssh_cmd(cmd)
  File "/Users/rliaw/miniconda3/envs/ray/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 291, in ssh_cmd
    stderr=redirect or sys.stderr)
  File "/Users/rliaw/miniconda3/envs/ray/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ssh', '-i', '/Users/rliaw/.ssh/ray-autoscaler_us-west-2.pem', '-o', 'ConnectTimeout=120s', '-o', 'StrictHostKeyChecking=no', '-o', 'ControlMaster=auto', '-o', 'ControlPath=/tmp/ray_ssh_sockets/%C', '-o', 'ControlPersist=5m', 'ubuntu@34.217.214.200', "bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp35-cp35m-manylinux1_x86_64.whl'"]' returned non-zero exit status 1.

2019-04-04 01:07:44,800	ERROR commands.py:260 -- get_or_create_head_node: Updating 34.217.214.200 failed

Note that the pip call is not executed within the docker instance (and should be). Can you take a look?

stefanpantic · 2019-04-04T08:23:10Z

Just tried running this - I think this breaks some part of the autoscaler. I tried running ray up example-docker-gpu.yaml on both master and this branch, and this branch fails with:

2019-04-04 01:07:44,704	ERROR updater.py:140 -- NodeUpdater: i-0f16fdcf21b889cbb: Error updating (Exit Status 1) ssh -i /Users/rliaw/.ssh/ray-autoscaler_us-west-2.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_sockets/%C -o ControlPersist=5m ubuntu@34.217.214.200 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp35-cp35m-manylinux1_x86_64.whl'
Exception in thread Thread-2:
Traceback (most recent call last):
  File "/Users/rliaw/miniconda3/envs/ray/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/Users/rliaw/miniconda3/envs/ray/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 143, in run
    raise e
  File "/Users/rliaw/miniconda3/envs/ray/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 132, in run
    self.do_update()
  File "/Users/rliaw/miniconda3/envs/ray/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 228, in do_update
    self.ssh_cmd(cmd)
  File "/Users/rliaw/miniconda3/envs/ray/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 291, in ssh_cmd
    stderr=redirect or sys.stderr)
  File "/Users/rliaw/miniconda3/envs/ray/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ssh', '-i', '/Users/rliaw/.ssh/ray-autoscaler_us-west-2.pem', '-o', 'ConnectTimeout=120s', '-o', 'StrictHostKeyChecking=no', '-o', 'ControlMaster=auto', '-o', 'ControlPath=/tmp/ray_ssh_sockets/%C', '-o', 'ControlPersist=5m', 'ubuntu@34.217.214.200', "bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp35-cp35m-manylinux1_x86_64.whl'"]' returned non-zero exit status 1.

2019-04-04 01:07:44,800	ERROR commands.py:260 -- get_or_create_head_node: Updating 34.217.214.200 failed

Note that the pip call is not executed within the docker instance (and should be). Can you take a look?

Could you post the entire startup log?

stefanpantic · 2019-04-04T08:35:16Z

@richardliaw Try it now, it should be fixed. Also, I think that the wheel should be updated in the pip install part for one that includes these changes

AmplabJenkins · 2019-04-07T11:33:30Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13618/
Test FAILed.

toanngosy · 2019-04-07T14:15:16Z

@stefanpantic I got the same problem while init a cluster on GCP, how to resolve it?

2019-04-07 19:32:28,366	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-07 19:32:33,583	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-07 19:32:38,781	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-07 19:32:43,982	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-07 19:32:46,609	INFO log_timer.py:21 -- NodeUpdater: ray-gpu-docker-head-78448c6a: Got SSH [LogTimer=134342ms]
2019-04-07 19:32:47,609	INFO node_provider.py:26 -- wait_for_compute_zone_operation: Waiting for operation operation-1554658367002-585f41bb655f3-6e875e6a-daafae19 to finish...
2019-04-07 19:32:54,568	INFO node_provider.py:37 -- wait_for_compute_zone_operation: Operation operation-1554658367002-585f41bb655f3-6e875e6a-daafae19 finished.
2019-04-07 19:32:54,569	INFO updater.py:198 -- NodeUpdater: ray-gpu-docker-head-78448c6a: Syncing /home/toanngo/.ssh/ray-autoscaler_gcp_us-west1_robustprothestics_ubuntu.pem to ~/ray_bootstrap_key.pem...
2019-04-07 19:32:54,569	INFO updater.py:268 -- NodeUpdater: Running mkdir -p ~ on 35.247.101.31...
2019-04-07 19:32:56,312	INFO log_timer.py:21 -- NodeUpdater ray-gpu-docker-head-78448c6a: Synced /home/toanngo/.ssh/ray-autoscaler_gcp_us-west1_robustprothestics_ubuntu.pem to ~/ray_bootstrap_key.pem [LogTimer=1743ms]
2019-04-07 19:32:56,313	INFO updater.py:198 -- NodeUpdater: ray-gpu-docker-head-78448c6a: Syncing /tmp/ray-bootstrap-0xyjtzqz to ~/ray_bootstrap_config.yaml...
2019-04-07 19:32:56,313	INFO updater.py:268 -- NodeUpdater: Running mkdir -p ~ on 35.247.101.31...
2019-04-07 19:32:58,087	INFO log_timer.py:21 -- NodeUpdater ray-gpu-docker-head-78448c6a: Synced /tmp/ray-bootstrap-0xyjtzqz to ~/ray_bootstrap_config.yaml [LogTimer=1774ms]
2019-04-07 19:33:02,973	INFO node_provider.py:26 -- wait_for_compute_zone_operation: Waiting for operation operation-1554658378463-585f41c653975-eac137dd-d211063a to finish...
2019-04-07 19:33:03,660	INFO node_provider.py:37 -- wait_for_compute_zone_operation: Operation operation-1554658378463-585f41c653975-eac137dd-d211063a finished.
2019-04-07 19:33:03,660	INFO updater.py:268 -- NodeUpdater: Running timeout 300 bash -c "
   command -v nvidia-smi && nvidia-smi
   until [ \$? -eq 0 ]; do
       command -v nvidia-smi && nvidia-smi
   done" on 35.247.101.31...
/usr/bin/nvidia-smi
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
Sun Apr  7 17:33:04 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.72       Driver Version: 410.72       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   60C    P0    87W / 149W |      0MiB / 11441MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
                                                                              
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
2019-04-07 19:33:06,037	INFO log_timer.py:21 -- NodeUpdater: ray-gpu-docker-head-78448c6a: Initialization commands completed [LogTimer=2376ms]
2019-04-07 19:33:06,038	INFO updater.py:268 -- NodeUpdater: Running docker inspect -f '{{.State.Running}}' ray-nvidia-docker || docker run --rm --name ray-nvidia-docker -d -it -p 6379:6379 -p 8076:8076 -p 4321:4321  -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --runtime=nvidia --net=host tensorflow/tensorflow:1.12.0-gpu-py3 bash on 35.247.101.31...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.
See 'docker run --help'.
2019-04-07 19:33:06,536	INFO log_timer.py:21 -- NodeUpdater: ray-gpu-docker-head-78448c6a: Setup commands completed [LogTimer=498ms]
2019-04-07 19:33:06,537	INFO log_timer.py:21 -- NodeUpdater: ray-gpu-docker-head-78448c6a: Applied config 8faf70d6bd22a9d01be381c24e34f89affc0457f [LogTimer=157004ms]
2019-04-07 19:33:06,537	ERROR updater.py:140 -- NodeUpdater: ray-gpu-docker-head-78448c6a: Error updating (Exit Status 125) ssh -i /home/toanngo/.ssh/ray-autoscaler_gcp_us-west1_robustprothestics_ubuntu.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_sockets/%C -o ControlPersist=5m ubuntu@35.247.101.31 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && docker inspect -f '"'"'{{.State.Running}}'"'"' ray-nvidia-docker || docker run --rm --name ray-nvidia-docker -d -it -p 6379:6379 -p 8076:8076 -p 4321:4321  -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --runtime=nvidia --net=host tensorflow/tensorflow:1.12.0-gpu-py3 bash'
2019-04-07 19:33:07,529	INFO node_provider.py:26 -- wait_for_compute_zone_operation: Waiting for operation operation-1554658386835-585f41ce4f8ce-e7c4771b-c6d5ef01 to finish...
2019-04-07 19:33:13,615	INFO node_provider.py:37 -- wait_for_compute_zone_operation: Operation operation-1554658386835-585f41ce4f8ce-e7c4771b-c6d5ef01 finished.
Exception in thread Thread-1:
Traceback (most recent call last):
 File "/home/toanngo/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
   self.run()
 File "/home/toanngo/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 143, in run
   raise e
 File "/home/toanngo/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 132, in run
   self.do_update()
 File "/home/toanngo/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 228, in do_update
   self.ssh_cmd(cmd)
 File "/home/toanngo/anaconda3/lib/python3.6/site-packages/ray/autoscaler/updater.py", line 291, in ssh_cmd
   stderr=redirect or sys.stderr)
 File "/home/toanngo/anaconda3/lib/python3.6/subprocess.py", line 291, in check_call
   raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ssh', '-i', '/home/toanngo/.ssh/ray-autoscaler_gcp_us-west1_robustprothestics_ubuntu.pem', '-o', 'ConnectTimeout=120s', '-o', 'StrictHostKeyChecking=no', '-o', 'ControlMaster=auto', '-o', 'ControlPath=/tmp/ray_ssh_sockets/%C', '-o', 'ControlPersist=5m', 'ubuntu@35.247.101.31', 'bash --login -c -i \'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && docker inspect -f \'"\'"\'{{.State.Running}}\'"\'"\' ray-nvidia-docker || docker run --rm --name ray-nvidia-docker -d -it -p 6379:6379 -p 8076:8076 -p 4321:4321  -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --runtime=nvidia --net=host tensorflow/tensorflow:1.12.0-gpu-py3 bash\'']' returned non-zero exit status 125.

2019-04-07 19:33:13,899	ERROR commands.py:258 -- get_or_create_head_node: Updating 35.247.101.31 failed

python/ray/autoscaler/autoscaler.py

richardliaw

Awesome - this works on AWS. I used to following diff applied to example-gpu-docker.yaml

diff --git a/python/ray/autoscaler/aws/example-gpu-docker.yaml b/python/ray/autoscaler/aws/example-gpu-docker.yaml
index 962685390..af636022f 100644
--- a/python/ray/autoscaler/aws/example-gpu-docker.yaml
+++ b/python/ray/autoscaler/aws/example-gpu-docker.yaml
@@ -3,7 +3,7 @@ cluster_name: gpu-docker

 # The minimum number of workers nodes to launch in addition to the head
 # node. This number should be >= 0.
-min_workers: 0
+min_workers: 1

 # The maximum number of workers nodes to launch in addition to the head
 # node. This takes precedence over min_workers.
@@ -18,18 +18,18 @@ initial_workers: 0
 # and opens all the necessary ports to support the Ray cluster.
 # Empty string means disabled.
 docker:
-    image: "tensorflow/tensorflow:1.12.0-gpu-py3"
+    # image: "tensorflow/tensorflow:1.12.0-py3"
     container_name: "ray-nvidia-docker-test" # e.g. ray_docker
-    run_options:
-      - --runtime=nvidia
+    #run_options:
+    #  - --runtime=nvidia

     # # Example of running a GPU head with CPU workers
-    # head_image: "tensorflow/tensorflow:1.13.1-gpu-py3"
-    # head_run_options:
-    #     - --runtime=nvidia
+    head_image: "tensorflow/tensorflow:1.13.1-gpu-py3"
+    head_run_options:
+        - --runtime=nvidia

-    # worker_image: "tensorflow/tensorflow:1.13.1-py3"
-    # worker_run_options: []
+    worker_image: "tensorflow/tensorflow:1.13.1-py3"
+    worker_run_options: []
 # The autoscaler will scale up the cluster to this target fraction of resource
 # usage. For example, if a cluster of 10 nodes is 100% busy and
@@ -63,7 +63,7 @@ auth:
 # For more documentation on available fields, see:
 # http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
 head_node:
-    InstanceType: p2.xlarge
+    InstanceType: g3.4xlarge
     ImageId: ami-0b294f219d14e6a82 # Deep Learning AMI (Ubuntu) Version 21.0

     # You can provision additional disk space with a conf as follows
@@ -96,12 +96,14 @@ worker_nodes:
 file_mounts: {
 #    "/path1/on/remote/machine": "/path1/on/local/machine",
 #    "/path2/on/remote/machine": "/path2/on/local/machine",
+     "/home/ubuntu/ray_code/": "/Users/rliaw/Research/riselab/ray/python/ray/"
 }

 # List of shell commands to run to set up nodes.
 setup_commands:
     # - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp27-cp27mu-manylinux1_x86_64.whl
     - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp35-cp35m-manylinux1_x86_64.whl
+    - /home/ubuntu/ray_code/rllib/setup-rllib-dev.py --yes
     # - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp36-cp36m-manylinux1_x86_64.whl

 # Custom commands that will be run on the head node after common setup.

@hartikainen can you try this out and then merge if this works?

hartikainen · 2019-04-07T23:50:54Z

I tested this with the following config on GCP:

diff --git a/python/ray/autoscaler/gcp/example-gpu-docker.yaml b/python/ray/autoscaler/gcp/example-gpu-docker.yaml
index fa1face51f81..ae777845506a 100644
--- a/python/ray/autoscaler/gcp/example-gpu-docker.yaml
+++ b/python/ray/autoscaler/gcp/example-gpu-docker.yaml
@@ -3,7 +3,7 @@ cluster_name: gpu-docker
 
 # The minimum number of workers nodes to launch in addition to the head
 # node. This number should be >= 0.
-min_workers: 0
+min_workers: 1
 
 # The maximum number of workers nodes to launch in addition to the head
 # node. This takes precedence over min_workers.
@@ -18,11 +18,17 @@ initial_workers: 0
 # and opens all the necessary ports to support the Ray cluster.
 # Empty string means disabled.
 docker:
-    image: "tensorflow/tensorflow:1.12.0-gpu-py3"
+    # image: "tensorflow/tensorflow:1.12.0-gpu-py3"
     container_name: "ray-nvidia-docker-test" # e.g. ray_docker
-    run_options:
-      - --runtime=nvidia
+    # run_options:
+    #   - --runtime=nvidia
 
+    head_image: "tensorflow/tensorflow:1.13.1-gpu-py3"
+    head_run_options:
+        - --runtime=nvidia
+
+    worker_image: "tensorflow/tensorflow:1.13.1-py3"
+    worker_run_options: []
 
 # The autoscaler will scale up the cluster to this target fraction of resource
 # usage. For example, if a cluster of 10 nodes is 100% busy and
@@ -39,7 +45,7 @@ provider:
     type: gcp
     region: us-west1
     availability_zone: us-west1-b
-    project_id: <project_id> # Globally unique project id
+    project_id: project_id # Globally unique project id
 
 # How Ray will authenticate with newly launched nodes.
 auth:
@@ -65,7 +71,7 @@ head_node:
           # See https://cloud.google.com/compute/docs/images for more images
           sourceImage: projects/deeplearning-platform-release/global/images/family/tf-latest-gpu
     guestAccelerators:
-      - acceleratorType: projects/<project_id>/zones/us-west1-b/acceleratorTypes/nvidia-tesla-k80
+      - acceleratorType: projects/project_id/zones/us-west1-b/acceleratorTypes/nvidia-tesla-k80
         acceleratorCount: 1
     metadata:
       items:
@@ -87,13 +93,6 @@ worker_nodes:
           diskSizeGb: 50
           # See https://cloud.google.com/compute/docs/images for more images
           sourceImage: projects/deeplearning-platform-release/global/images/family/tf-latest-gpu
-    guestAccelerators:
-      - acceleratorType: projects/<project_id>/zones/us-west1-b/acceleratorTypes/nvidia-tesla-k80
-        acceleratorCount: 1
-    metadata:
-      items:
-        - key: install-nvidia-driver
-          value: "True"
     # Run workers on preemtible instance by default.
     # Comment this out to use on-demand.
     scheduling:
@@ -108,6 +107,7 @@ worker_nodes:
 file_mounts: {
 #    "/path1/on/remote/machine": "/path1/on/local/machine",
 #    "/path2/on/remote/machine": "/path2/on/local/machine",
+    "/home/ubuntu/ray_code/": "~/github/hartikainen/ray/python/ray/"
 }
 
 initialization_commands:
@@ -129,6 +129,7 @@ setup_commands:
     # Install ray
     # - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp27-cp27mu-manylinux1_x86_64.whl
     - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp35-cp35m-manylinux1_x86_64.whl
+    - /home/ubuntu/ray_code/rllib/setup-rllib-dev.py --yes
     # - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp36-cp36m-manylinux1_x86_64.whl
 
 # Custom commands that will be run on the head node after common setup.

Works fine. I'll merge this.

toanngosy · 2019-04-08T08:10:26Z

@hartikainen when I am trying to run again, I still got the problem

bash: no job control in this shell

then

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.

Do you think of any possible error that I could have?

richardliaw · 2019-04-08T08:26:23Z

Can you post the full stack trace?

stefanpantic · 2019-04-08T08:27:25Z

You can safely ignore the bash warning, it's irrelevant.

toanngosy · 2019-04-08T08:42:08Z

Here is the full trace @richard4912 , thanks.

ray up cluster.yaml -y                                                   
2019-04-08 10:36:17,038	INFO commands.py:189 -- get_or_create_head_node: Launching new head node...
2019-04-08 10:36:18,632	INFO node_provider.py:26 -- wait_for_compute_zone_operation: Waiting for operation operation-1554712577098-58600bae2c474-c0ff746f-881157eb to finish...
2019-04-08 10:36:52,368	INFO node_provider.py:37 -- wait_for_compute_zone_operation: Operation operation-1554712577098-58600bae2c474-c0ff746f-881157eb finished.
2019-04-08 10:36:52,633	INFO commands.py:202 -- get_or_create_head_node: Updating files on head node...
2019-04-08 10:36:52,637	INFO updater.py:128 -- NodeUpdater: ray-gpu-docker-head-01ccc301: Updating to 861d207745c2b9e999e1d6b20a2909416607a519
2019-04-08 10:36:53,739	INFO node_provider.py:26 -- wait_for_compute_zone_operation: Waiting for operation operation-1554712613043-58600bd073dad-8d7d3da2-a2621f59 to finish...
2019-04-08 10:36:59,441	INFO node_provider.py:37 -- wait_for_compute_zone_operation: Operation operation-1554712613043-58600bd073dad-8d7d3da2-a2621f59 finished.
2019-04-08 10:36:59,441	INFO updater.py:90 -- NodeUpdater: Waiting for IP of ray-gpu-docker-head-01ccc301...
2019-04-08 10:36:59,441	INFO log_timer.py:21 -- NodeUpdater: ray-gpu-docker-head-01ccc301: Got IP [LogTimer=0ms]
2019-04-08 10:36:59,454	INFO updater.py:155 -- NodeUpdater: ray-gpu-docker-head-01ccc301: Waiting for SSH...
2019-04-08 10:36:59,455	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:37:04,744	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:37:09,959	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:37:15,200	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:37:20,467	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:37:25,672	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:37:30,872	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:37:36,096	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:37:41,309	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:37:46,556	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:37:51,780	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:37:57,002	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:38:02,228	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:38:07,452	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:38:12,671	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:38:17,876	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:38:23,094	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:38:28,313	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:38:33,525	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:38:38,751	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:38:43,972	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:38:49,196	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:38:54,416	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:38:59,620	INFO updater.py:268 -- NodeUpdater: Running uptime on 35.247.101.31...
2019-04-08 10:39:02,462	INFO log_timer.py:21 -- NodeUpdater: ray-gpu-docker-head-01ccc301: Got SSH [LogTimer=123007ms]
2019-04-08 10:39:03,544	INFO node_provider.py:26 -- wait_for_compute_zone_operation: Waiting for operation operation-1554712742801-58600c4c32fd6-4f8acd1a-c1fef099 to finish...
2019-04-08 10:39:09,690	INFO node_provider.py:37 -- wait_for_compute_zone_operation: Operation operation-1554712742801-58600c4c32fd6-4f8acd1a-c1fef099 finished.
2019-04-08 10:39:09,691	INFO updater.py:198 -- NodeUpdater: ray-gpu-docker-head-01ccc301: Syncing /home/toanngo/.ssh/ray-autoscaler_gcp_us-west1_robustprothestics_ubuntu.pem to ~/ray_bootstrap_key.pem...
2019-04-08 10:39:09,692	INFO updater.py:268 -- NodeUpdater: Running mkdir -p ~ on 35.247.101.31...
2019-04-08 10:39:11,492	INFO log_timer.py:21 -- NodeUpdater ray-gpu-docker-head-01ccc301: Synced /home/toanngo/.ssh/ray-autoscaler_gcp_us-west1_robustprothestics_ubuntu.pem to ~/ray_bootstrap_key.pem [LogTimer=1800ms]
2019-04-08 10:39:11,493	INFO updater.py:198 -- NodeUpdater: ray-gpu-docker-head-01ccc301: Syncing /tmp/ray-bootstrap-fmcaxfby to ~/ray_bootstrap_config.yaml...
2019-04-08 10:39:11,494	INFO updater.py:268 -- NodeUpdater: Running mkdir -p ~ on 35.247.101.31...
2019-04-08 10:39:13,255	INFO log_timer.py:21 -- NodeUpdater ray-gpu-docker-head-01ccc301: Synced /tmp/ray-bootstrap-fmcaxfby to ~/ray_bootstrap_config.yaml [LogTimer=1761ms]
2019-04-08 10:39:14,226	INFO node_provider.py:26 -- wait_for_compute_zone_operation: Waiting for operation operation-1554712753584-58600c567b938-c2d59239-7dac8e71 to finish...
2019-04-08 10:39:20,133	INFO node_provider.py:37 -- wait_for_compute_zone_operation: Operation operation-1554712753584-58600c567b938-c2d59239-7dac8e71 finished.
2019-04-08 10:39:20,133	INFO updater.py:268 -- NodeUpdater: Running timeout 300 bash -c "
    command -v nvidia-smi && nvidia-smi
    until [ \$? -eq 0 ]; do
        command -v nvidia-smi && nvidia-smi
    done" on 35.247.101.31...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
/usr/bin/nvidia-smi
Mon Apr  8 08:39:21 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.72       Driver Version: 410.72       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   63C    P0    86W / 149W |      0MiB / 11441MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
2019-04-08 10:39:22,592	INFO log_timer.py:21 -- NodeUpdater: ray-gpu-docker-head-01ccc301: Initialization commands completed [LogTimer=2458ms]
2019-04-08 10:39:22,592	INFO updater.py:268 -- NodeUpdater: Running docker inspect -f '{{.State.Running}}' ray-nvidia-docker-test || docker run --rm --name ray-nvidia-docker-test -d -it -p 6379:6379 -p 8076:8076 -p 4321:4321  -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --runtime=nvidia --net=host tensorflow/tensorflow:1.13.1-gpu-py3 bash on 35.247.101.31...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell

Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.
See 'docker run --help'.
2019-04-08 10:39:23,085	INFO log_timer.py:21 -- NodeUpdater: ray-gpu-docker-head-01ccc301: Setup commands completed [LogTimer=493ms]
2019-04-08 10:39:23,086	INFO log_timer.py:21 -- NodeUpdater: ray-gpu-docker-head-01ccc301: Applied config 861d207745c2b9e999e1d6b20a2909416607a519 [LogTimer=150449ms]
2019-04-08 10:39:23,087	ERROR updater.py:140 -- NodeUpdater: ray-gpu-docker-head-01ccc301: Error updating (Exit Status 125) ssh -i /home/toanngo/.ssh/ray-autoscaler_gcp_us-west1_robustprothestics_ubuntu.pem -o ConnectTimeout=120s -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPath=/tmp/ray_ssh_sockets/%C -o ControlPersist=5m ubuntu@35.247.101.31 bash --login -c -i 'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && docker inspect -f '"'"'{{.State.Running}}'"'"' ray-nvidia-docker-test || docker run --rm --name ray-nvidia-docker-test -d -it -p 6379:6379 -p 8076:8076 -p 4321:4321  -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --runtime=nvidia --net=host tensorflow/tensorflow:1.13.1-gpu-py3 bash'
2019-04-08 10:39:24,174	INFO node_provider.py:26 -- wait_for_compute_zone_operation: Waiting for operation operation-1554712763442-58600c5fe2559-39e99e1d-cda7b998 to finish...
2019-04-08 10:39:30,272	INFO node_provider.py:37 -- wait_for_compute_zone_operation: Operation operation-1554712763442-58600c5fe2559-39e99e1d-cda7b998 finished.
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/home/toanngo/anaconda3/lib/python3.6/threading.py", line 916, in _bootstrap_inner
    self.run()
  File "/home/toanngo/Documents/GitHub/ray/python/ray/autoscaler/updater.py", line 143, in run
    raise e
  File "/home/toanngo/Documents/GitHub/ray/python/ray/autoscaler/updater.py", line 132, in run
    self.do_update()
  File "/home/toanngo/Documents/GitHub/ray/python/ray/autoscaler/updater.py", line 228, in do_update
    self.ssh_cmd(cmd)
  File "/home/toanngo/Documents/GitHub/ray/python/ray/autoscaler/updater.py", line 291, in ssh_cmd
    stderr=redirect or sys.stderr)
  File "/home/toanngo/anaconda3/lib/python3.6/subprocess.py", line 311, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['ssh', '-i', '/home/toanngo/.ssh/ray-autoscaler_gcp_us-west1_robustprothestics_ubuntu.pem', '-o', 'ConnectTimeout=120s', '-o', 'StrictHostKeyChecking=no', '-o', 'ControlMaster=auto', '-o', 'ControlPath=/tmp/ray_ssh_sockets/%C', '-o', 'ControlPersist=5m', 'ubuntu@35.247.101.31', 'bash --login -c -i \'true && source ~/.bashrc && export OMP_NUM_THREADS=1 PYTHONWARNINGS=ignore && docker inspect -f \'"\'"\'{{.State.Running}}\'"\'"\' ray-nvidia-docker-test || docker run --rm --name ray-nvidia-docker-test -d -it -p 6379:6379 -p 8076:8076 -p 4321:4321  -e LC_ALL=C.UTF-8 -e LANG=C.UTF-8 --runtime=nvidia --net=host tensorflow/tensorflow:1.13.1-gpu-py3 bash\'']' returned non-zero exit status 125.

2019-04-08 10:39:30,522	ERROR commands.py:258 -- get_or_create_head_node: Updating 35.247.101.31 failed

toanngosy · 2019-04-08T08:47:22Z

bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell

Maybe I have a problem with docker. After above, I ran ray up cluster.yaml again to update the cluster and these bash warning show up multiple times.

stefanpantic · 2019-04-08T08:51:07Z

Yes they allways come in pairs, you can safely ignore them. You can read about that here. I think that there might be a problem with your docker daemon. Could you post the output of

ps -ef | grep docker

toanngosy · 2019-04-08T09:03:52Z

ubuntu@ray-gpu-docker-head-a9e49086:~$ ps -ef | grep docker
root     31839     1  0 08:54 ?        00:00:00 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock
ubuntu   32011 32007  0 08:56 pts/1    00:00:00 grep docker

This is the result executed on the server

stefanpantic · 2019-04-08T09:38:17Z

The user that you are running as may not have permissions to talk to /var/run/docker.sock on that system.

toanngosy · 2019-04-08T10:15:04Z

I changed in docker.py and make it to sudo docker ... but it still cannot talk. What should I do?

stefanpantic · 2019-04-08T12:36:02Z

Could you send me the cluster.yaml you are trying to create.

toanngosy · 2019-04-08T14:29:04Z

Here is my cluster.yaml I am trying to create

# An unique identifier for the head node and workers of this cluster.
cluster_name: gpu-docker

# The minimum number of workers nodes to launch in addition to the head
# node. This number should be >= 0.
min_workers: 1

# The maximum number of workers nodes to launch in addition to the head
# node. This takes precedence over min_workers.
max_workers: 2

# The initial number of worker nodes to launch in addition to the head
# node. When the cluster is first brought up (or when it is refreshed with a
# subsequent `ray up`) this number of nodes will be started.
initial_workers: 1

# This executes all commands on all nodes in the docker container,
# and opens all the necessary ports to support the Ray cluster.
# Empty string means disabled.
docker:
    container_name: "ray-nvidia-docker-test" # e.g. ray_docker

    head_image: "tensorflow/tensorflow:1.13.1-gpu-py3"
    head_run_options:
      - --runtime=nvidia

    worker_image: "tensorflow/tensorflow:1.13.1-py3"
    worker_run_options: []

# The autoscaler will scale up the cluster to this target fraction of resource
# usage. For example, if a cluster of 10 nodes is 100% busy and
# target_utilization is 0.8, it would resize the cluster to 13. This fraction
# can be decreased to increase the aggressiveness of upscaling.
# This value must be less than 1.0 for scaling to happen.
target_utilization_fraction: 0.8

# If a node is idle for this many minutes, it will be removed.
#idle_timeout_minutes: 5

# Cloud-provider specific configuration.
provider:
    type: gcp
    region: us-west1
    availability_zone: us-west1-b
    project_id: robustprothestics # Globally unique project id

# How Ray will authenticate with newly launched nodes.
auth:
    ssh_user: ubuntu
# By default Ray creates a new private keypair, but you can also use your own.
# If you do so, make sure to also set "KeyName" in the head and worker node
# configurations below. This requires that you have added the key into the
# project wide meta-data.
#    ssh_private_key: /path/to/your/key.pem

# Provider-specific config for the head node, e.g. instance type. By default
# Ray will auto-configure unspecified fields such as subnets and ssh-keys.
# For more documentation on available fields, see:
# https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert
head_node:
    machineType: n1-standard-4
    disks:
      - boot: true
        autoDelete: true
        type: PERSISTENT
        initializeParams:
          diskSizeGb: 50
          # See https://cloud.google.com/compute/docs/images for more images
          sourceImage: projects/deeplearning-platform-release/global/images/family/tf-latest-gpu
    guestAccelerators:
      - acceleratorType: projects/robustprothestics/zones/us-west1-b/acceleratorTypes/nvidia-tesla-k80
        acceleratorCount: 1
    metadata:
      items:
        - key: install-nvidia-driver
          value: "True"
    scheduling:
      - onHostMaintenance: TERMINATE

    # Additional options can be found in in the compute docs at
    # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert

worker_nodes:
    machineType: n1-standard-4
    disks:
      - boot: true
        autoDelete: true
        type: PERSISTENT
        initializeParams:
          diskSizeGb: 50
          # See https://cloud.google.com/compute/docs/images for more images
          sourceImage: projects/deeplearning-platform-release/global/images/family/tf-latest-gpu
    # Run workers on preemtible instance by default.
    # Comment this out to use on-demand.
    scheduling:
      - preemptible: true
      - onHostMaintenance: TERMINATE

    # Additional options can be found in in the compute docs at
    # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert

# Files or directories to copy to the head and worker nodes. The format is a
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
file_mounts: {
#    "/path1/on/remote/machine": "/path1/on/local/machine",
#    "/path2/on/remote/machine": "/path2/on/local/machine",
}

initialization_commands:
    # Wait until nvidia drivers are installed
    - >-
      timeout 300 bash -c "
          command -v nvidia-smi && nvidia-smi
          until [ \$? -eq 0 ]; do
              command -v nvidia-smi && nvidia-smi
          done"
# List of shell commands to run to set up nodes.
setup_commands:
    # Note: if you're developing Ray, you probably want to create an AMI that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # - echo 'export PATH="$HOME/anaconda3/envs/tensorflow_p36/bin:$PATH"' >> ~/.bashrc

    # Install ray
    # - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp27-cp27mu-manylinux1_x86_64.whl
    - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp35-cp35m-manylinux1_x86_64.whl
    # - pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp36-cp36m-manylinux1_x86_64.whl

# Custom commands that will be run on the head node after common setup.
head_setup_commands:
  - pip install google-api-python-client==1.7.8

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --head
      --redis-port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --redis-address=$RAY_HEAD_IP:6379
      --object-manager-port=8076

Do I have to grant ray anything related to docker in IAM?

guoyuhong · 2019-04-09T03:33:43Z

This PR introduced python Lint failure to master.

hartikainen · 2019-04-09T04:35:37Z

Sorry, that's my bad. Should be fixed by #4584.

toanngosy · 2019-04-09T13:53:19Z

@hartikainen have you encountered the docker problem I had above?

hartikainen · 2019-04-09T16:39:55Z

@toanngosy unfortunately I have not seen the error before. I just tested your configuration and everything runs smoothly. Here's the command I ran:

AUTOSCALER_CONFIG_FILE=${RAY_PATH}/python/ray/autoscaler/gcp/example-2.yaml \
  && ray down -y ${AUTOSCALER_CONFIG_FILE} \
  && ray exec ${AUTOSCALER_CONFIG_FILE} \
    --docker \
    'python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"' \
    --start

toanngosy · 2019-04-10T00:00:27Z

I thought I messed up with the ssh and all the permission of GCP, so I deleted everything and start fresh. It is working smoothly now. Thanks @hartikainen

* Lint code that we forgot to lint in previous PR * Revert setup command merge * Lint * Revert "Revert setup command merge" This reverts commit 55e1cdb. * Fix testReportsConfigFailures test * Minor syntax tweaks * Lint

stefanpantic added 11 commits April 1, 2019 14:21

Added support for running different docker containers on clusters

869ebca

Remove node specific container names

86e9d39

Keep old options and expand with node specific configuration

a9226e8

Merge branch 'master' of https://github.com/ray-project/ray into gpu-…

5d895a2

…cpu-docker-cluster

Optimized imports

8be44c1

Changed docker fields for autoscaler

903d61f

Auto reformat

d82bf7d

Updated comments

c3121c0

Updated condition

1ae08ec

Merge branch 'master' of https://github.com/ray-project/ray into gpu-…

2ad6494

…cpu-docker-cluster

Run linter

5cb8d2c

Updated example

03f276d

richardliaw changed the title ~~[ray]Add support for separate docker containers on head and worker nodes in ray auto-scaler~~ [autoscaler] Add support for separate docker containers on head and worker nodes Apr 2, 2019

richardliaw reviewed Apr 2, 2019

View reviewed changes

python/ray/autoscaler/aws/example-full.yaml Show resolved Hide resolved

richardliaw reviewed Apr 2, 2019

View reviewed changes

python/ray/autoscaler/docker.py Outdated Show resolved Hide resolved

stefanpantic added 2 commits April 3, 2019 09:52

Changed condition for docker images, updated examples

878d869

Merge branch 'master' of https://github.com/ray-project/ray into gpu-…

8435117

…cpu-docker-cluster

richardliaw reviewed Apr 4, 2019

View reviewed changes

python/ray/autoscaler/aws/example-full.yaml Outdated Show resolved Hide resolved

richardliaw reviewed Apr 4, 2019

View reviewed changes

python/ray/autoscaler/aws/example-gpu-docker.yaml Outdated Show resolved Hide resolved

stefanpantic added 2 commits April 4, 2019 10:03

Merge branch 'master' of https://github.com/ray-project/ray into gpu-…

775fc8a

…cpu-docker-cluster

Removed duplicate line

c517811

Fixed setup_commands

3a03972

richardliaw reviewed Apr 7, 2019

View reviewed changes

python/ray/autoscaler/autoscaler.py Show resolved Hide resolved

richardliaw approved these changes Apr 7, 2019

View reviewed changes

hartikainen merged commit 9154869 into ray-project:master Apr 7, 2019

guoyuhong mentioned this pull request Apr 9, 2019

Improve code related to node #4383

Merged

hartikainen mentioned this pull request Apr 9, 2019

[autoscaler] Lint code that we forgot to lint in #4537 #4584

Merged

stefanpantic deleted the gpu-cpu-docker-cluster branch April 19, 2019 09:20

stefanpantic restored the gpu-cpu-docker-cluster branch April 19, 2019 09:20

stefanpantic deleted the gpu-cpu-docker-cluster branch April 19, 2019 09:43

AdamGleave mentioned this pull request Dec 9, 2019

Refactor autoscaler config #6182

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[autoscaler] Add support for separate docker containers on head and worker nodes #4537

[autoscaler] Add support for separate docker containers on head and worker nodes #4537

stefanpantic commented Apr 2, 2019

AmplabJenkins commented Apr 2, 2019

AmplabJenkins commented Apr 2, 2019

AmplabJenkins commented Apr 2, 2019

richardliaw left a comment

AmplabJenkins commented Apr 3, 2019

stefanpantic commented Apr 4, 2019

richardliaw commented Apr 4, 2019

stefanpantic commented Apr 4, 2019

stefanpantic commented Apr 4, 2019

AmplabJenkins commented Apr 7, 2019

toanngosy commented Apr 7, 2019 •

edited

Loading

richardliaw left a comment •

edited

Loading

hartikainen commented Apr 7, 2019 •

edited

Loading

toanngosy commented Apr 8, 2019

richardliaw commented Apr 8, 2019

stefanpantic commented Apr 8, 2019

toanngosy commented Apr 8, 2019

toanngosy commented Apr 8, 2019

stefanpantic commented Apr 8, 2019

toanngosy commented Apr 8, 2019

stefanpantic commented Apr 8, 2019

toanngosy commented Apr 8, 2019

stefanpantic commented Apr 8, 2019

toanngosy commented Apr 8, 2019 •

edited

Loading

guoyuhong commented Apr 9, 2019

hartikainen commented Apr 9, 2019

toanngosy commented Apr 9, 2019

hartikainen commented Apr 9, 2019

toanngosy commented Apr 10, 2019

[autoscaler] Add support for separate docker containers on head and worker nodes #4537

[autoscaler] Add support for separate docker containers on head and worker nodes #4537

Conversation

stefanpantic commented Apr 2, 2019

What do these changes do?

Related issue number

Linter

AmplabJenkins commented Apr 2, 2019

AmplabJenkins commented Apr 2, 2019

AmplabJenkins commented Apr 2, 2019

richardliaw left a comment

Choose a reason for hiding this comment

AmplabJenkins commented Apr 3, 2019

stefanpantic commented Apr 4, 2019

richardliaw commented Apr 4, 2019

stefanpantic commented Apr 4, 2019

stefanpantic commented Apr 4, 2019

AmplabJenkins commented Apr 7, 2019

toanngosy commented Apr 7, 2019 • edited Loading

richardliaw left a comment • edited Loading

Choose a reason for hiding this comment

hartikainen commented Apr 7, 2019 • edited Loading

toanngosy commented Apr 8, 2019

richardliaw commented Apr 8, 2019

stefanpantic commented Apr 8, 2019

toanngosy commented Apr 8, 2019

toanngosy commented Apr 8, 2019

stefanpantic commented Apr 8, 2019

toanngosy commented Apr 8, 2019

stefanpantic commented Apr 8, 2019

toanngosy commented Apr 8, 2019

stefanpantic commented Apr 8, 2019

toanngosy commented Apr 8, 2019 • edited Loading

guoyuhong commented Apr 9, 2019

hartikainen commented Apr 9, 2019

toanngosy commented Apr 9, 2019

hartikainen commented Apr 9, 2019

toanngosy commented Apr 10, 2019

toanngosy commented Apr 7, 2019 •

edited

Loading

richardliaw left a comment •

edited

Loading

hartikainen commented Apr 7, 2019 •

edited

Loading

toanngosy commented Apr 8, 2019 •

edited

Loading