EC2 cluster setup scripts and initial version of auto-scaler (ray-pro…

…ject#1311)
kenqyu · Dec 16, 2017 · f5ea443 · f5ea443
1 parent 76b6b4a
commit f5ea443
Show file tree

Hide file tree

Showing 20 changed files with 1,665 additions and 16 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -112,6 +112,7 @@ script:
   - python test/runtest.py
   - python test/array_test.py
   - python test/actor_test.py
+  - python test/autoscaler_test.py
   - python test/tensorflow_test.py
   - python test/failure_test.py
   - python test/microbenchmarks.py

diff --git a/.travis/install-dependencies.sh b/.travis/install-dependencies.sh
@@ -24,15 +24,15 @@ if [[ "$PYTHON" == "2.7" ]] && [[ "$platform" == "linux" ]]; then
   wget https://repo.continuum.io/miniconda/Miniconda2-latest-Linux-x86_64.sh -O miniconda.sh -nv
   bash miniconda.sh -b -p $HOME/miniconda
   export PATH="$HOME/miniconda/bin:$PATH"
-  pip install -q numpy cloudpickle==0.5.2 cython cmake funcsigs click colorama psutil redis tensorflow gym flatbuffers opencv-python
+  pip install -q numpy cloudpickle==0.5.2 cython cmake funcsigs click colorama psutil redis tensorflow gym flatbuffers opencv-python pyyaml
 elif [[ "$PYTHON" == "3.5" ]] && [[ "$platform" == "linux" ]]; then
   sudo apt-get update
   sudo apt-get install -y cmake pkg-config python-dev python-numpy build-essential autoconf curl libtool unzip
   # Install miniconda.
   wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh -nv
   bash miniconda.sh -b -p $HOME/miniconda
   export PATH="$HOME/miniconda/bin:$PATH"
-  pip install -q numpy cloudpickle==0.5.2 cython cmake funcsigs click colorama psutil redis tensorflow gym flatbuffers opencv-python
+  pip install -q numpy cloudpickle==0.5.2 cython cmake funcsigs click colorama psutil redis tensorflow gym flatbuffers opencv-python pyyaml
 elif [[ "$PYTHON" == "2.7" ]] && [[ "$platform" == "macosx" ]]; then
   # check that brew is installed
   which -s brew
@@ -48,7 +48,7 @@ elif [[ "$PYTHON" == "2.7" ]] && [[ "$platform" == "macosx" ]]; then
   wget https://repo.continuum.io/miniconda/Miniconda2-latest-MacOSX-x86_64.sh -O miniconda.sh -nv
   bash miniconda.sh -b -p $HOME/miniconda
   export PATH="$HOME/miniconda/bin:$PATH"
-  pip install -q numpy cloudpickle==0.5.2 cython cmake funcsigs click colorama psutil redis tensorflow gym flatbuffers opencv-python
+  pip install -q numpy cloudpickle==0.5.2 cython cmake funcsigs click colorama psutil redis tensorflow gym flatbuffers opencv-python pyyaml
 elif [[ "$PYTHON" == "3.5" ]] && [[ "$platform" == "macosx" ]]; then
   # check that brew is installed
   which -s brew
@@ -64,7 +64,7 @@ elif [[ "$PYTHON" == "3.5" ]] && [[ "$platform" == "macosx" ]]; then
   wget https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -O miniconda.sh -nv
   bash miniconda.sh -b -p $HOME/miniconda
   export PATH="$HOME/miniconda/bin:$PATH"
-  pip install -q numpy cloudpickle==0.5.2 cython cmake funcsigs click colorama psutil redis tensorflow gym flatbuffers opencv-python
+  pip install -q numpy cloudpickle==0.5.2 cython cmake funcsigs click colorama psutil redis tensorflow gym flatbuffers opencv-python pyyaml
 elif [[ "$LINT" == "1" ]]; then
   sudo apt-get update
   sudo apt-get install -y cmake build-essential autoconf curl libtool unzip

diff --git a/doc/requirements-doc.txt b/doc/requirements-doc.txt
@@ -6,6 +6,7 @@ mock
 numpy
 opencv-python
 pyarrow
+pyyaml
 psutil
 recommonmark
 redis

diff --git a/doc/source/autoscaling.rst b/doc/source/autoscaling.rst
@@ -0,0 +1,88 @@
+Cluster setup and auto-scaling (Experimental)
+=============================================
+
+Quick start
+-----------
+
+First, ensure you have configured your AWS credentials in ``~/.aws/credentials``,
+as described in `the boto docs <http://boto3.readthedocs.io/en/latest/guide/configuration.html>`__.
+
+Then you're ready to go. The provided `ray/python/ray/autoscaler/aws/example.yaml <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/aws/example.yaml>`__ cluster config file will create a small cluster with a m4.large
+head node (on-demand), and two m4.large `spot workers <https://aws.amazon.com/ec2/spot/>`__.
+Try it out with these commands:
+
+.. code-block:: bash
+
+    # Create or update the cluster
+    $ ray create_or_update ray/python/ray/autoscaler/aws/example.yaml
+
+    # Resize the cluster without interrupting running jobs
+    $ ray create_or_update ray/python/ray/autoscaler/aws/example.yaml \
+        --max-workers=N --sync-only
+
+    # Teardown the cluster
+    $ ray teardown ray/python/ray/autoscaler/aws/example.yaml
+
+Common configurations
+---------------------
+
+Note: auto-scaling support is not fully implemented yet (targeted for 0.4.0).
+
+The example configuration above is enough to get started with Ray, but for more
+compute intensive workloads you will want to change the instance types to e.g.
+use GPU or larger compute instance by editing the yaml file. Here are a few common
+configurations:
+
+**GPU single node**: use Ray on a single large GPU instance.
+
+.. code-block:: yaml
+
+    max_workers: 0
+    head_node:
+        InstanceType: p2.8xlarge
+
+**Mixed GPU and CPU nodes**: for RL applications that require proportionally more
+CPU than GPU resources, you can use additional CPU workers with a GPU head node.
+
+.. code-block:: yaml
+
+    max_workers: 10
+    head_node:
+        InstanceType: p2.8xlarge
+    worker_nodes:
+        InstanceType: m4.16xlarge
+
+**Autoscaling CPU cluster**: use a small head node and have Ray auto-scale
+workers as needed. This can be a cost-efficient configuration for clusters with
+bursty workloads. You can also request spot workers for additional cost savings.
+
+.. code-block:: yaml
+
+    min_workers: 0
+    max_workers: 10
+    head_node:
+        InstanceType: m4.large
+    worker_nodes:
+        InstanceMarketOptions:
+            MarketType: spot
+        InstanceType: m4.16xlarge
+
+**Autoscaling GPU cluster**: similar to the autoscaling CPU cluster, but
+with GPU worker nodes instead.
+
+.. code-block:: yaml
+
+    min_workers: 0
+    max_workers: 10
+    head_node:
+        InstanceType: m4.large
+    worker_nodes:
+        InstanceMarketOptions:
+            MarketType: spot
+        InstanceType: p2.8xlarge
+
+Additional Cloud providers
+--------------------------
+
+To use Ray autoscaling on other Cloud providers or cluster management systems, you can implement the ``NodeProvider`` interface
+(~100 LOC) and register it in `node_provider.py <https://github.com/ray-project/ray/tree/master/python/ray/autoscaler/node_provider.py>`__.
diff --git a/doc/source/index.rst b/doc/source/index.rst
@@ -84,6 +84,7 @@ Example Program
    :maxdepth: 1
    :caption: Cluster Usage
 
+   autoscaling.rst
    using-ray-on-a-cluster.rst
    using-ray-on-a-large-cluster.rst
    using-ray-and-docker-on-a-cluster.md

diff --git a/python/ray/autoscaler/__init__.py b/python/ray/autoscaler/__init__.py
-Original file line number
+Diff line change
@@ Expand Up / @@ -6,6 +6,7 @@ mock @@
     numpy
     opencv-python
     pyarrow
+    pyyaml
     psutil
     recommonmark
     redis
@@ Expand Down @@