Skip to content

Commit 515da77

Browse files
robertnishiharapcmoritz
authored andcommitted
Change ray.worker.cleanup -> ray.shutdown and improve API documentation. (ray-project#2374)
* Change ray.worker.cleanup -> ray.shutdown and improve API documentation. * Deprecate ray.worker.cleanup() gracefully. * Fix linting
1 parent b316afe commit 515da77

30 files changed

+279
-404
lines changed

doc/requirements-doc.txt

+1
Original file line numberDiff line numberDiff line change
@@ -11,5 +11,6 @@ psutil
1111
recommonmark
1212
redis
1313
sphinx
14+
sphinx-click
1415
sphinx_rtd_theme
1516
pandas

doc/source/api.rst

+25-260
Original file line numberDiff line numberDiff line change
@@ -1,284 +1,49 @@
11
The Ray API
22
===========
33

4-
Starting Ray
5-
------------
6-
7-
There are two main ways in which Ray can be used. First, you can start all of
8-
the relevant Ray processes and shut them all down within the scope of a single
9-
script. Second, you can connect to and use an existing Ray cluster.
10-
11-
Starting and stopping a cluster within a script
12-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
13-
14-
One use case is to start all of the relevant Ray processes when you call
15-
``ray.init`` and shut them down when the script exits. These processes include
16-
local and global schedulers, an object store and an object manager, a redis
17-
server, and more.
18-
19-
**Note:** this approach is limited to a single machine.
20-
21-
This can be done as follows.
22-
23-
.. code-block:: python
24-
25-
ray.init()
26-
27-
If there are GPUs available on the machine, you should specify this with the
28-
``num_gpus`` argument. Similarly, you can also specify the number of CPUs with
29-
``num_cpus``.
30-
31-
.. code-block:: python
32-
33-
ray.init(num_cpus=20, num_gpus=2)
34-
35-
By default, Ray will use ``psutil.cpu_count()`` to determine the number of CPUs.
36-
Ray will also attempt to automatically determine the number of GPUs.
37-
38-
Instead of thinking about the number of "worker" processes on each node, we
39-
prefer to think in terms of the quantities of CPU and GPU resources on each
40-
node and to provide the illusion of an infinite pool of workers. Tasks will be
41-
assigned to workers based on the availability of resources so as to avoid
42-
contention and not based on the number of available worker processes.
43-
44-
Connecting to an existing cluster
45-
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
46-
47-
Once a Ray cluster has been started, the only thing you need in order to connect
48-
to it is the address of the Redis server in the cluster. In this case, your
49-
script will not start up or shut down any processes. The cluster and all of its
50-
processes may be shared between multiple scripts and multiple users. To do this,
51-
you simply need to know the address of the cluster's Redis server. This can be
52-
done with a command like the following.
53-
54-
.. code-block:: python
55-
56-
ray.init(redis_address="12.345.67.89:6379")
57-
58-
In this case, you cannot specify ``num_cpus`` or ``num_gpus`` in ``ray.init``
59-
because that information is passed into the cluster when the cluster is started,
60-
not when your script is started.
61-
62-
View the instructions for how to `start a Ray cluster`_ on multiple nodes.
63-
64-
.. _`start a Ray cluster`: http://ray.readthedocs.io/en/latest/using-ray-on-a-cluster.html
65-
664
.. autofunction:: ray.init
675

68-
Defining remote functions
69-
-------------------------
70-
71-
Remote functions are used to create tasks. To define a remote function, the
72-
``@ray.remote`` decorator is placed over the function definition.
73-
74-
The function can then be invoked with ``f.remote``. Invoking the function
75-
creates a **task** which will be scheduled on and executed by some worker
76-
process in the Ray cluster. The call will return an **object ID** (essentially a
77-
future) representing the eventual return value of the task. Anyone with the
78-
object ID can retrieve its value, regardless of where the task was executed (see
79-
`Getting values from object IDs`_).
80-
81-
When a task executes, its outputs will be serialized into a string of bytes and
82-
stored in the object store.
83-
84-
Note that arguments to remote functions can be values or object IDs.
85-
86-
.. code-block:: python
87-
88-
@ray.remote
89-
def f(x):
90-
return x + 1
91-
92-
x_id = f.remote(0)
93-
ray.get(x_id) # 1
94-
95-
y_id = f.remote(x_id)
96-
ray.get(y_id) # 2
97-
98-
If you want a remote function to return multiple object IDs, you can do that by
99-
passing the ``num_return_vals`` argument into the remote decorator.
100-
101-
.. code-block:: python
102-
103-
@ray.remote(num_return_vals=2)
104-
def f():
105-
return 1, 2
106-
107-
x_id, y_id = f.remote()
108-
ray.get(x_id) # 1
109-
ray.get(y_id) # 2
110-
1116
.. autofunction:: ray.remote
1127

113-
Getting values from object IDs
114-
------------------------------
115-
116-
Object IDs can be converted into objects by calling ``ray.get`` on the object
117-
ID. Note that ``ray.get`` accepts either a single object ID or a list of object
118-
IDs.
119-
120-
.. code-block:: python
121-
122-
@ray.remote
123-
def f():
124-
return {'key1': ['value']}
125-
126-
# Get one object ID.
127-
ray.get(f.remote()) # {'key1': ['value']}
128-
129-
# Get a list of object IDs.
130-
ray.get([f.remote() for _ in range(2)]) # [{'key1': ['value']}, {'key1': ['value']}]
131-
132-
Numpy arrays
133-
~~~~~~~~~~~~
134-
135-
Numpy arrays are handled more efficiently than other data types, so **use numpy
136-
arrays whenever possible**.
137-
138-
Any numpy arrays that are part of the serialized object will not be copied out
139-
of the object store. They will remain in the object store and the resulting
140-
deserialized object will simply have a pointer to the relevant place in the
141-
object store's memory.
142-
143-
Since objects in the object store are immutable, this means that if you want to
144-
mutate a numpy array that was returned by a remote function, you will have to
145-
first copy it.
146-
1478
.. autofunction:: ray.get
1489

149-
Putting objects in the object store
150-
-----------------------------------
151-
152-
The primary way that objects are placed in the object store is by being returned
153-
by a task. However, it is also possible to directly place objects in the object
154-
store using ``ray.put``.
155-
156-
.. code-block:: python
157-
158-
x_id = ray.put(1)
159-
ray.get(x_id) # 1
160-
161-
The main reason to use ``ray.put`` is that you want to pass the same large
162-
object into a number of tasks. By first doing ``ray.put`` and then passing the
163-
resulting object ID into each of the tasks, the large object is copied into the
164-
object store only once, whereas when we directly pass the object in, it is
165-
copied multiple times.
166-
167-
.. code-block:: python
168-
169-
import numpy as np
170-
171-
@ray.remote
172-
def f(x):
173-
pass
174-
175-
x = np.zeros(10 ** 6)
176-
177-
# Alternative 1: Here, x is copied into the object store 10 times.
178-
[f.remote(x) for _ in range(10)]
179-
180-
# Alternative 2: Here, x is copied into the object store once.
181-
x_id = ray.put(x)
182-
[f.remote(x_id) for _ in range(10)]
183-
184-
Note that ``ray.put`` is called under the hood in a couple situations.
185-
186-
- It is called on the values returned by a task.
187-
- It is called on the arguments to a task, unless the arguments are Python
188-
primitives like integers or short strings, lists, tuples, or dictionaries.
10+
.. autofunction:: ray.wait
18911

19012
.. autofunction:: ray.put
19113

192-
Waiting for a subset of tasks to finish
193-
---------------------------------------
194-
195-
It is often desirable to adapt the computation being done based on when
196-
different tasks finish. For example, if a bunch of tasks each take a variable
197-
length of time, and their results can be processed in any order, then it makes
198-
sense to simply process the results in the order that they finish. In other
199-
settings, it makes sense to discard straggler tasks whose results may not be
200-
needed.
201-
202-
To do this, we introduce the ``ray.wait`` primitive, which takes a list of
203-
object IDs and returns when a subset of them are available. By default it blocks
204-
until a single object is available, but the ``num_returns`` value can be
205-
specified to wait for a different number. If a ``timeout`` argument is passed
206-
in, it will block for at most that many milliseconds and may return a list with
207-
fewer than ``num_returns`` elements.
208-
209-
The ``ray.wait`` function returns two lists. The first list is a list of object
210-
IDs of available objects (of length at most ``num_returns``), and the second
211-
list is a list of the remaining object IDs, so the combination of these two
212-
lists is equal to the list passed in to ``ray.wait`` (up to ordering).
213-
214-
.. code-block:: python
215-
216-
import time
217-
import numpy as np
14+
.. autofunction:: ray.get_gpu_ids
21815

219-
@ray.remote
220-
def f(n):
221-
time.sleep(n)
222-
return n
223-
224-
# Start 3 tasks with different durations.
225-
results = [f.remote(i) for i in range(3)]
226-
# Block until 2 of them have finished.
227-
ready_ids, remaining_ids = ray.wait(results, num_returns=2)
228-
229-
# Start 5 tasks with different durations.
230-
results = [f.remote(i) for i in range(5)]
231-
# Block until 4 of them have finished or 2.5 seconds pass.
232-
ready_ids, remaining_ids = ray.wait(results, num_returns=4, timeout=2500)
233-
234-
It is easy to use this construct to create an infinite loop in which multiple
235-
tasks are executing, and whenever one task finishes, a new one is launched.
236-
237-
.. code-block:: python
238-
239-
@ray.remote
240-
def f():
241-
return 1
242-
243-
# Start 5 tasks.
244-
remaining_ids = [f.remote() for i in range(5)]
245-
# Whenever one task finishes, start a new one.
246-
for _ in range(100):
247-
ready_ids, remaining_ids = ray.wait(remaining_ids)
248-
# Get the available object and do something with it.
249-
print(ray.get(ready_ids))
250-
# Start a new task.
251-
remaining_ids.append(f.remote())
252-
253-
.. autofunction:: ray.wait
16+
.. autofunction:: ray.get_resource_ids
25417

255-
Viewing errors
256-
--------------
18+
.. autofunction:: ray.get_webui_url
25719

258-
Keeping track of errors that occur in different processes throughout a cluster
259-
can be challenging. There are a couple mechanisms to help with this.
20+
.. autofunction:: ray.shutdown
26021

261-
1. If a task throws an exception, that exception will be printed in the
262-
background of the driver process.
22+
.. autofunction:: ray.register_custom_serializer
26323

264-
2. If ``ray.get`` is called on an object ID whose parent task threw an exception
265-
before creating the object, the exception will be re-raised by ``ray.get``.
24+
.. autofunction:: ray.profile
26625

267-
The errors will also be accumulated in Redis and can be accessed with
268-
``ray.error_info``. Normally, you shouldn't need to do this, but it is possible.
26+
.. autofunction:: ray.method
26927

270-
.. code-block:: python
28+
The Ray Command Line API
29+
------------------------
27130

272-
@ray.remote
273-
def f():
274-
raise Exception("This task failed!!")
31+
.. click:: ray.scripts.scripts:start
32+
:prog: ray start
33+
:show-nested:
27534

276-
f.remote() # An error message will be printed in the background.
35+
.. click:: ray.scripts.scripts:stop
36+
:prog: ray stop
37+
:show-nested:
27738

278-
# Wait for the error to propagate to Redis.
279-
import time
280-
time.sleep(1)
39+
.. click:: ray.scripts.scripts:create_or_update
40+
:prog: ray create_or_update
41+
:show-nested:
28142

282-
ray.error_info() # This returns a list containing the error message.
43+
.. click:: ray.scripts.scripts:teardown
44+
:prog: ray teardown
45+
:show-nested:
28346

284-
.. autofunction:: ray.error_info
47+
.. click:: ray.scripts.scripts:get_head_ip
48+
:prog: ray get_head_ip
49+
:show-nested:

doc/source/conf.py

+1
Original file line numberDiff line numberDiff line change
@@ -72,6 +72,7 @@
7272
extensions = [
7373
'sphinx.ext.autodoc',
7474
'sphinx.ext.napoleon',
75+
'sphinx_click.ext',
7576
]
7677

7778
# Add any paths that contain templates here, relative to this directory.

python/ray/__init__.py

+5-4
Original file line numberDiff line numberDiff line change
@@ -50,8 +50,8 @@
5050
from ray.worker import (error_info, init, connect, disconnect, get, put, wait,
5151
remote, profile, flush_profile_data, get_gpu_ids,
5252
get_resource_ids, get_webui_url,
53-
register_custom_serializer) # noqa: E402
54-
from ray.worker import (SCRIPT_MODE, WORKER_MODE, PYTHON_MODE,
53+
register_custom_serializer, shutdown) # noqa: E402
54+
from ray.worker import (SCRIPT_MODE, WORKER_MODE, LOCAL_MODE,
5555
SILENT_MODE) # noqa: E402
5656
from ray.worker import global_state # noqa: E402
5757
# We import ray.actor because some code is run in actor.py which initializes
@@ -67,8 +67,9 @@
6767
"error_info", "init", "connect", "disconnect", "get", "put", "wait",
6868
"remote", "profile", "flush_profile_data", "actor", "method",
6969
"get_gpu_ids", "get_resource_ids", "get_webui_url",
70-
"register_custom_serializer", "SCRIPT_MODE", "WORKER_MODE", "PYTHON_MODE",
71-
"SILENT_MODE", "global_state", "ObjectID", "_config", "__version__"
70+
"register_custom_serializer", "shutdown", "SCRIPT_MODE", "WORKER_MODE",
71+
"LOCAL_MODE", "SILENT_MODE", "global_state", "ObjectID", "_config",
72+
"__version__"
7273
]
7374

7475
import ctypes # noqa: E402

0 commit comments

Comments
 (0)