ray exec and ray attach commands (ray-project#2560)

ray exec CLUSTER CMD [--screen] [--start] [--stop] ray attach CLUSTER [--start] Example: ray exec sgd.yaml 'source activate tensorflow_p27 && cd ~/ray/python/ray/rllib && ./train.py --run=PPO --env=CartPole-v0' --screen --start --stop This will in one command create a cluster and run the command on it in a screen session. The screen can later be attached to via ray attach. After the command finishes, the cluster workers will be terminated and the head node stopped.
sukritkalra · Aug 15, 2018 · 079c4e4 · 079c4e4
1 parent 53f9755
commit 079c4e4
Show file tree

Hide file tree

Showing 10 changed files with 311 additions and 64 deletions.
diff --git a/doc/source/api.rst b/doc/source/api.rst
@@ -37,11 +37,19 @@ The Ray Command Line API
    :show-nested:
 
 .. click:: ray.scripts.scripts:create_or_update
-   :prog: ray create_or_update
+   :prog: ray up
    :show-nested:
 
 .. click:: ray.scripts.scripts:teardown
-   :prog: ray teardown
+   :prog: ray down
+   :show-nested:
+
+.. click:: ray.scripts.scripts:exec_cmd
+   :prog: ray exec
+   :show-nested:
+
+.. click:: ray.scripts.scripts:attach
+   :prog: ray attach
    :show-nested:
 
 .. click:: ray.scripts.scripts:get_head_ip

diff --git a/doc/source/autoscaling.rst b/doc/source/autoscaling.rst
@@ -1,7 +1,7 @@
 Cloud Setup and Auto-Scaling
 ============================
 
-The ``ray create_or_update`` command starts an AWS or GCP Ray cluster from your personal computer. Once the cluster is up, you can then SSH into it to run Ray programs.
+The ``ray up`` command starts or updates an AWS or GCP Ray cluster from your personal computer. Once the cluster is up, you can then SSH into it to run Ray programs.
 
 Quick start (AWS)
 -----------------
@@ -18,14 +18,14 @@ SSH into the head node, ``source activate tensorflow_p36``, and then run Ray pro
 
     # Create or update the cluster. When the command finishes, it will print
     # out the command that can be used to SSH into the cluster head node.
-    $ ray create_or_update ray/python/ray/autoscaler/aws/example-full.yaml
+    $ ray up ray/python/ray/autoscaler/aws/example-full.yaml
 
     # Reconfigure autoscaling behavior without interrupting running jobs
-    $ ray create_or_update ray/python/ray/autoscaler/aws/example-full.yaml \
+    $ ray up ray/python/ray/autoscaler/aws/example-full.yaml \
         --max-workers=N --no-restart
 
     # Teardown the cluster
-    $ ray teardown ray/python/ray/autoscaler/aws/example-full.yaml
+    $ ray down ray/python/ray/autoscaler/aws/example-full.yaml
 
 Quick start (GCP)
 -----------------
@@ -41,14 +41,50 @@ SSH into the head node and then run Ray programs with ``ray.init(redis_address="
 
     # Create or update the cluster. When the command finishes, it will print
     # out the command that can be used to SSH into the cluster head node.
-    $ ray create_or_update ray/python/ray/autoscaler/gcp/example-full.yaml
+    $ ray up ray/python/ray/autoscaler/gcp/example-full.yaml
 
     # Reconfigure autoscaling behavior without interrupting running jobs
-    $ ray create_or_update ray/python/ray/autoscaler/gcp/example-full.yaml \
+    $ ray up ray/python/ray/autoscaler/gcp/example-full.yaml \
         --max-workers=N --no-restart
 
     # Teardown the cluster
-    $ ray teardown ray/python/ray/autoscaler/gcp/example-full.yaml
+    $ ray down ray/python/ray/autoscaler/gcp/example-full.yaml
+
+Running commands on new and existing clusters
+---------------------------------------------
+
+You can use ``ray exec`` to conveniently run commands on clusters. Note that scripts you run should connect to Ray via ``ray.init(redis_address="localhost:6379")``.
+
+.. code-block:: bash
+
+    # Run a command on the cluster
+    $ ray exec cluster.yaml 'echo "hello world"'
+
+    # Run a command on the cluster, starting it if needed
+    $ ray exec cluster.yaml 'echo "hello world"' --start
+
+    # Run a command on the cluster, stopping the cluster after it finishes
+    $ ray exec cluster.yaml 'echo "hello world"' --stop
+
+    # Run a command on a new cluster called 'experiment-1', stopping it after
+    $ ray exec cluster.yaml 'echo "hello world"' \
+        --start --stop --cluster-name experiment-1
+
+    # Run a command in a screen (experimental)
+    $ ray exec cluster.yaml 'echo "hello world"' --screen
+
+Attaching to the cluster
+------------------------
+
+You can use ``ray attach`` to attach to an interactive console on the cluster.
+
+.. code-block:: bash
+
+    # Open a screen on the cluster
+    $ ray attach cluster.yaml
+
+    # Open a screen on a new cluster called 'session-1'
+    $ ray attach cluster.yaml --start --cluster-name=session-1
 
 Port-forwarding applications
 ----------------------------
@@ -62,9 +98,9 @@ To run connect to applications running on the cluster (e.g. Jupyter notebook) us
 Updating your cluster
 ---------------------
 
-When you run ``ray create_or_update`` with an existing cluster, the command checks if the local configuration differs from the applied configuration of the cluster. This includes any changes to synced files specified in the ``file_mounts`` section of the config. If so, the new files and config will be uploaded to the cluster. Following that, Ray services will be restarted.
+When you run ``ray up`` with an existing cluster, the command checks if the local configuration differs from the applied configuration of the cluster. This includes any changes to synced files specified in the ``file_mounts`` section of the config. If so, the new files and config will be uploaded to the cluster. Following that, Ray services will be restarted.
 
-You can also run ``ray create_or_update`` to restart a cluster if it seems to be in a bad state (this will restart all Ray services even if there are no config changes).
+You can also run ``ray up`` to restart a cluster if it seems to be in a bad state (this will restart all Ray services even if there are no config changes).
 
 If you don't want the update to restart services (e.g. because the changes don't require a restart), pass ``--no-restart`` to the update call.
 
@@ -115,11 +151,11 @@ A common use case is syncing a particular local git branch to all workers of the
         - test -e <REPO_NAME> || git clone https://github.com/<REPO_ORG>/<REPO_NAME>.git
         - cd <REPO_NAME> && git fetch && git checkout `cat /tmp/current_branch_sha`
 
-This tells ``ray create_or_update`` to sync the current git branch SHA from your personal computer to a temporary file on the cluster (assuming you've pushed the branch head already). Then, the setup commands read that file to figure out which SHA they should checkout on the nodes. Note that each command runs in its own session. The final workflow to update the cluster then becomes just this:
+This tells ``ray up`` to sync the current git branch SHA from your personal computer to a temporary file on the cluster (assuming you've pushed the branch head already). Then, the setup commands read that file to figure out which SHA they should checkout on the nodes. Note that each command runs in its own session. The final workflow to update the cluster then becomes just this:
 
 1. Make local changes to a git branch
 2. Commit the changes with ``git commit`` and ``git push``
-3. Update files on your Ray cluster with ``ray create_or_update``
+3. Update files on your Ray cluster with ``ray up``
 
 Common cluster configurations
 -----------------------------
@@ -174,7 +210,7 @@ with GPU worker nodes instead.
 
 .. code-block:: yaml
 
-    min_workers: 0
+    min_workers: 1  # must have at least 1 GPU worker (issue #2106)
     max_workers: 10
     head_node:
         InstanceType: m4.large

diff --git a/doc/source/rllib.rst b/doc/source/rllib.rst
@@ -73,3 +73,11 @@ Package Reference
 * `ray.rllib.models <rllib-package-ref.html#module-ray.rllib.models>`__
 * `ray.rllib.optimizers <rllib-package-ref.html#module-ray.rllib.optimizers>`__
 * `ray.rllib.utils <rllib-package-ref.html#module-ray.rllib.utils>`__
+
+Troubleshooting
+---------------
+
+If you encounter errors like
+`blas_thread_init: pthread_create: Resource temporarily unavailable` when using many workers,
+try setting ``OMP_NUM_THREADS=1``. Similarly, check configured system limits with
+`ulimit -a` for other resource limit errors.
diff --git a/doc/source/using-ray-on-a-cluster.rst b/doc/source/using-ray-on-a-cluster.rst
@@ -1,9 +1,9 @@
-Using Ray on a Cluster
-======================
+Manual Cluster Setup
+====================
 
 .. note::
 
-    If you're using AWS you can use the automated `setup commands <http://ray.readthedocs.io/en/latest/autoscaling.html>`__.
+    If you're using AWS or GCP you should use the automated `setup commands <http://ray.readthedocs.io/en/latest/autoscaling.html>`__.
 
 The instructions in this document work well for small clusters. For larger
 clusters, follow the instructions for `managing a cluster with parallel ssh`_.

diff --git a/doc/source/using-ray-on-a-large-cluster.rst b/doc/source/using-ray-on-a-large-cluster.rst
@@ -1,9 +1,9 @@
-Using Ray on a Large Cluster
-============================
+Manual Cluster Setup on a Large Cluster
+=======================================
 
 .. note::
 
-    If you're using AWS you can use the automated `setup commands <http://ray.readthedocs.io/en/latest/autoscaling.html>`__.
+    If you're using AWS or GCP you should use the automated `setup commands <http://ray.readthedocs.io/en/latest/autoscaling.html>`__.
 
 Deploying Ray on a cluster requires a bit of manual work. The instructions here
 illustrate how to use parallel ssh commands to simplify the process of running

diff --git a/python/ray/autoscaler/autoscaler.py b/python/ray/autoscaler/autoscaler.py
@@ -188,6 +188,9 @@ def _info(self):
                         max_frac = frac
             nodes_used += max_frac
         idle_times = [now - t for t in self.last_used_time_by_ip.values()]
+        heartbeat_times = [
+            now - t for t in self.last_heartbeat_time_by_ip.values()
+        ]
         return {
             "ResourceUsage": ", ".join([
                 "{}/{} {}".format(
@@ -201,6 +204,10 @@ def _info(self):
                 int(np.min(idle_times)) if idle_times else -1,
                 int(np.mean(idle_times)) if idle_times else -1,
                 int(np.max(idle_times)) if idle_times else -1),
+            "TimeSinceLastHeartbeat": "Min={} Mean={} Max={}".format(
+                int(np.min(heartbeat_times)) if heartbeat_times else -1,
+                int(np.mean(heartbeat_times)) if heartbeat_times else -1,
+                int(np.max(heartbeat_times)) if heartbeat_times else -1),
         }
 
 
@@ -504,14 +511,17 @@ def update_if_needed(self, node_id):
             return
         if self.files_up_to_date(node_id):
             return
-        if self.config.get("no_restart", False) and \
-                self.num_successful_updates.get(node_id, 0) > 0:
+        successful_updated = self.num_successful_updates.get(node_id, 0) > 0
+        if successful_updated and self.config.get("restart_only", False):
+            init_commands = self.config["worker_start_ray_commands"]
+        elif successful_updated and self.config.get("no_restart", False):
             init_commands = (self.config["setup_commands"] +
                              self.config["worker_setup_commands"])
         else:
             init_commands = (self.config["setup_commands"] +
                              self.config["worker_setup_commands"] +
                              self.config["worker_start_ray_commands"])
+
         updater = self.node_updater_cls(
             node_id,
             self.config["provider"],