Skip to content

Commit

Permalink
ray exec and ray attach commands (ray-project#2560)
Browse files Browse the repository at this point in the history
ray exec CLUSTER CMD [--screen] [--start] [--stop]
ray attach CLUSTER [--start]

Example:
ray exec sgd.yaml 'source activate tensorflow_p27 && cd ~/ray/python/ray/rllib && ./train.py --run=PPO --env=CartPole-v0' --screen --start --stop

This will in one command create a cluster and run the command on it in a screen session. The screen can later be attached to via ray attach. After the command finishes, the cluster workers will be terminated and the head node stopped.
  • Loading branch information
ericl authored Aug 15, 2018
1 parent 53f9755 commit 079c4e4
Show file tree
Hide file tree
Showing 10 changed files with 311 additions and 64 deletions.
12 changes: 10 additions & 2 deletions doc/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,11 +37,19 @@ The Ray Command Line API
:show-nested:

.. click:: ray.scripts.scripts:create_or_update
:prog: ray create_or_update
:prog: ray up
:show-nested:

.. click:: ray.scripts.scripts:teardown
:prog: ray teardown
:prog: ray down
:show-nested:

.. click:: ray.scripts.scripts:exec_cmd
:prog: ray exec
:show-nested:

.. click:: ray.scripts.scripts:attach
:prog: ray attach
:show-nested:

.. click:: ray.scripts.scripts:get_head_ip
Expand Down
60 changes: 48 additions & 12 deletions doc/source/autoscaling.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Cloud Setup and Auto-Scaling
============================

The ``ray create_or_update`` command starts an AWS or GCP Ray cluster from your personal computer. Once the cluster is up, you can then SSH into it to run Ray programs.
The ``ray up`` command starts or updates an AWS or GCP Ray cluster from your personal computer. Once the cluster is up, you can then SSH into it to run Ray programs.

Quick start (AWS)
-----------------
Expand All @@ -18,14 +18,14 @@ SSH into the head node, ``source activate tensorflow_p36``, and then run Ray pro
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray create_or_update ray/python/ray/autoscaler/aws/example-full.yaml
$ ray up ray/python/ray/autoscaler/aws/example-full.yaml
# Reconfigure autoscaling behavior without interrupting running jobs
$ ray create_or_update ray/python/ray/autoscaler/aws/example-full.yaml \
$ ray up ray/python/ray/autoscaler/aws/example-full.yaml \
--max-workers=N --no-restart
# Teardown the cluster
$ ray teardown ray/python/ray/autoscaler/aws/example-full.yaml
$ ray down ray/python/ray/autoscaler/aws/example-full.yaml
Quick start (GCP)
-----------------
Expand All @@ -41,14 +41,50 @@ SSH into the head node and then run Ray programs with ``ray.init(redis_address="
# Create or update the cluster. When the command finishes, it will print
# out the command that can be used to SSH into the cluster head node.
$ ray create_or_update ray/python/ray/autoscaler/gcp/example-full.yaml
$ ray up ray/python/ray/autoscaler/gcp/example-full.yaml
# Reconfigure autoscaling behavior without interrupting running jobs
$ ray create_or_update ray/python/ray/autoscaler/gcp/example-full.yaml \
$ ray up ray/python/ray/autoscaler/gcp/example-full.yaml \
--max-workers=N --no-restart
# Teardown the cluster
$ ray teardown ray/python/ray/autoscaler/gcp/example-full.yaml
$ ray down ray/python/ray/autoscaler/gcp/example-full.yaml
Running commands on new and existing clusters
---------------------------------------------

You can use ``ray exec`` to conveniently run commands on clusters. Note that scripts you run should connect to Ray via ``ray.init(redis_address="localhost:6379")``.

.. code-block:: bash
# Run a command on the cluster
$ ray exec cluster.yaml 'echo "hello world"'
# Run a command on the cluster, starting it if needed
$ ray exec cluster.yaml 'echo "hello world"' --start
# Run a command on the cluster, stopping the cluster after it finishes
$ ray exec cluster.yaml 'echo "hello world"' --stop
# Run a command on a new cluster called 'experiment-1', stopping it after
$ ray exec cluster.yaml 'echo "hello world"' \
--start --stop --cluster-name experiment-1
# Run a command in a screen (experimental)
$ ray exec cluster.yaml 'echo "hello world"' --screen
Attaching to the cluster
------------------------

You can use ``ray attach`` to attach to an interactive console on the cluster.

.. code-block:: bash
# Open a screen on the cluster
$ ray attach cluster.yaml
# Open a screen on a new cluster called 'session-1'
$ ray attach cluster.yaml --start --cluster-name=session-1
Port-forwarding applications
----------------------------
Expand All @@ -62,9 +98,9 @@ To run connect to applications running on the cluster (e.g. Jupyter notebook) us
Updating your cluster
---------------------

When you run ``ray create_or_update`` with an existing cluster, the command checks if the local configuration differs from the applied configuration of the cluster. This includes any changes to synced files specified in the ``file_mounts`` section of the config. If so, the new files and config will be uploaded to the cluster. Following that, Ray services will be restarted.
When you run ``ray up`` with an existing cluster, the command checks if the local configuration differs from the applied configuration of the cluster. This includes any changes to synced files specified in the ``file_mounts`` section of the config. If so, the new files and config will be uploaded to the cluster. Following that, Ray services will be restarted.

You can also run ``ray create_or_update`` to restart a cluster if it seems to be in a bad state (this will restart all Ray services even if there are no config changes).
You can also run ``ray up`` to restart a cluster if it seems to be in a bad state (this will restart all Ray services even if there are no config changes).

If you don't want the update to restart services (e.g. because the changes don't require a restart), pass ``--no-restart`` to the update call.

Expand Down Expand Up @@ -115,11 +151,11 @@ A common use case is syncing a particular local git branch to all workers of the
- test -e <REPO_NAME> || git clone https://github.com/<REPO_ORG>/<REPO_NAME>.git
- cd <REPO_NAME> && git fetch && git checkout `cat /tmp/current_branch_sha`
This tells ``ray create_or_update`` to sync the current git branch SHA from your personal computer to a temporary file on the cluster (assuming you've pushed the branch head already). Then, the setup commands read that file to figure out which SHA they should checkout on the nodes. Note that each command runs in its own session. The final workflow to update the cluster then becomes just this:
This tells ``ray up`` to sync the current git branch SHA from your personal computer to a temporary file on the cluster (assuming you've pushed the branch head already). Then, the setup commands read that file to figure out which SHA they should checkout on the nodes. Note that each command runs in its own session. The final workflow to update the cluster then becomes just this:

1. Make local changes to a git branch
2. Commit the changes with ``git commit`` and ``git push``
3. Update files on your Ray cluster with ``ray create_or_update``
3. Update files on your Ray cluster with ``ray up``

Common cluster configurations
-----------------------------
Expand Down Expand Up @@ -174,7 +210,7 @@ with GPU worker nodes instead.

.. code-block:: yaml
min_workers: 0
min_workers: 1 # must have at least 1 GPU worker (issue #2106)
max_workers: 10
head_node:
InstanceType: m4.large
Expand Down
8 changes: 8 additions & 0 deletions doc/source/rllib.rst
Original file line number Diff line number Diff line change
Expand Up @@ -73,3 +73,11 @@ Package Reference
* `ray.rllib.models <rllib-package-ref.html#module-ray.rllib.models>`__
* `ray.rllib.optimizers <rllib-package-ref.html#module-ray.rllib.optimizers>`__
* `ray.rllib.utils <rllib-package-ref.html#module-ray.rllib.utils>`__

Troubleshooting
---------------

If you encounter errors like
`blas_thread_init: pthread_create: Resource temporarily unavailable` when using many workers,
try setting ``OMP_NUM_THREADS=1``. Similarly, check configured system limits with
`ulimit -a` for other resource limit errors.
6 changes: 3 additions & 3 deletions doc/source/using-ray-on-a-cluster.rst
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
Using Ray on a Cluster
======================
Manual Cluster Setup
====================

.. note::

If you're using AWS you can use the automated `setup commands <http://ray.readthedocs.io/en/latest/autoscaling.html>`__.
If you're using AWS or GCP you should use the automated `setup commands <http://ray.readthedocs.io/en/latest/autoscaling.html>`__.

The instructions in this document work well for small clusters. For larger
clusters, follow the instructions for `managing a cluster with parallel ssh`_.
Expand Down
6 changes: 3 additions & 3 deletions doc/source/using-ray-on-a-large-cluster.rst
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
Using Ray on a Large Cluster
============================
Manual Cluster Setup on a Large Cluster
=======================================

.. note::

If you're using AWS you can use the automated `setup commands <http://ray.readthedocs.io/en/latest/autoscaling.html>`__.
If you're using AWS or GCP you should use the automated `setup commands <http://ray.readthedocs.io/en/latest/autoscaling.html>`__.

Deploying Ray on a cluster requires a bit of manual work. The instructions here
illustrate how to use parallel ssh commands to simplify the process of running
Expand Down
14 changes: 12 additions & 2 deletions python/ray/autoscaler/autoscaler.py
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,9 @@ def _info(self):
max_frac = frac
nodes_used += max_frac
idle_times = [now - t for t in self.last_used_time_by_ip.values()]
heartbeat_times = [
now - t for t in self.last_heartbeat_time_by_ip.values()
]
return {
"ResourceUsage": ", ".join([
"{}/{} {}".format(
Expand All @@ -201,6 +204,10 @@ def _info(self):
int(np.min(idle_times)) if idle_times else -1,
int(np.mean(idle_times)) if idle_times else -1,
int(np.max(idle_times)) if idle_times else -1),
"TimeSinceLastHeartbeat": "Min={} Mean={} Max={}".format(
int(np.min(heartbeat_times)) if heartbeat_times else -1,
int(np.mean(heartbeat_times)) if heartbeat_times else -1,
int(np.max(heartbeat_times)) if heartbeat_times else -1),
}


Expand Down Expand Up @@ -504,14 +511,17 @@ def update_if_needed(self, node_id):
return
if self.files_up_to_date(node_id):
return
if self.config.get("no_restart", False) and \
self.num_successful_updates.get(node_id, 0) > 0:
successful_updated = self.num_successful_updates.get(node_id, 0) > 0
if successful_updated and self.config.get("restart_only", False):
init_commands = self.config["worker_start_ray_commands"]
elif successful_updated and self.config.get("no_restart", False):
init_commands = (self.config["setup_commands"] +
self.config["worker_setup_commands"])
else:
init_commands = (self.config["setup_commands"] +
self.config["worker_setup_commands"] +
self.config["worker_start_ray_commands"])

updater = self.node_updater_cls(
node_id,
self.config["provider"],
Expand Down
Loading

0 comments on commit 079c4e4

Please sign in to comment.