|
1 | 1 | The Ray API
|
2 | 2 | ===========
|
3 | 3 |
|
4 |
| -Starting Ray |
5 |
| ------------- |
6 |
| - |
7 |
| -There are two main ways in which Ray can be used. First, you can start all of |
8 |
| -the relevant Ray processes and shut them all down within the scope of a single |
9 |
| -script. Second, you can connect to and use an existing Ray cluster. |
10 |
| - |
11 |
| -Starting and stopping a cluster within a script |
12 |
| -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
13 |
| - |
14 |
| -One use case is to start all of the relevant Ray processes when you call |
15 |
| -``ray.init`` and shut them down when the script exits. These processes include |
16 |
| -local and global schedulers, an object store and an object manager, a redis |
17 |
| -server, and more. |
18 |
| - |
19 |
| -**Note:** this approach is limited to a single machine. |
20 |
| - |
21 |
| -This can be done as follows. |
22 |
| - |
23 |
| -.. code-block:: python |
24 |
| -
|
25 |
| - ray.init() |
26 |
| -
|
27 |
| -If there are GPUs available on the machine, you should specify this with the |
28 |
| -``num_gpus`` argument. Similarly, you can also specify the number of CPUs with |
29 |
| -``num_cpus``. |
30 |
| - |
31 |
| -.. code-block:: python |
32 |
| -
|
33 |
| - ray.init(num_cpus=20, num_gpus=2) |
34 |
| -
|
35 |
| -By default, Ray will use ``psutil.cpu_count()`` to determine the number of CPUs. |
36 |
| -Ray will also attempt to automatically determine the number of GPUs. |
37 |
| - |
38 |
| -Instead of thinking about the number of "worker" processes on each node, we |
39 |
| -prefer to think in terms of the quantities of CPU and GPU resources on each |
40 |
| -node and to provide the illusion of an infinite pool of workers. Tasks will be |
41 |
| -assigned to workers based on the availability of resources so as to avoid |
42 |
| -contention and not based on the number of available worker processes. |
43 |
| - |
44 |
| -Connecting to an existing cluster |
45 |
| -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
46 |
| - |
47 |
| -Once a Ray cluster has been started, the only thing you need in order to connect |
48 |
| -to it is the address of the Redis server in the cluster. In this case, your |
49 |
| -script will not start up or shut down any processes. The cluster and all of its |
50 |
| -processes may be shared between multiple scripts and multiple users. To do this, |
51 |
| -you simply need to know the address of the cluster's Redis server. This can be |
52 |
| -done with a command like the following. |
53 |
| - |
54 |
| -.. code-block:: python |
55 |
| -
|
56 |
| - ray.init(redis_address="12.345.67.89:6379") |
57 |
| -
|
58 |
| -In this case, you cannot specify ``num_cpus`` or ``num_gpus`` in ``ray.init`` |
59 |
| -because that information is passed into the cluster when the cluster is started, |
60 |
| -not when your script is started. |
61 |
| - |
62 |
| -View the instructions for how to `start a Ray cluster`_ on multiple nodes. |
63 |
| - |
64 |
| -.. _`start a Ray cluster`: http://ray.readthedocs.io/en/latest/using-ray-on-a-cluster.html |
65 |
| - |
66 | 4 | .. autofunction:: ray.init
|
67 | 5 |
|
68 |
| -Defining remote functions |
69 |
| -------------------------- |
70 |
| - |
71 |
| -Remote functions are used to create tasks. To define a remote function, the |
72 |
| -``@ray.remote`` decorator is placed over the function definition. |
73 |
| - |
74 |
| -The function can then be invoked with ``f.remote``. Invoking the function |
75 |
| -creates a **task** which will be scheduled on and executed by some worker |
76 |
| -process in the Ray cluster. The call will return an **object ID** (essentially a |
77 |
| -future) representing the eventual return value of the task. Anyone with the |
78 |
| -object ID can retrieve its value, regardless of where the task was executed (see |
79 |
| -`Getting values from object IDs`_). |
80 |
| - |
81 |
| -When a task executes, its outputs will be serialized into a string of bytes and |
82 |
| -stored in the object store. |
83 |
| - |
84 |
| -Note that arguments to remote functions can be values or object IDs. |
85 |
| - |
86 |
| -.. code-block:: python |
87 |
| -
|
88 |
| - @ray.remote |
89 |
| - def f(x): |
90 |
| - return x + 1 |
91 |
| -
|
92 |
| - x_id = f.remote(0) |
93 |
| - ray.get(x_id) # 1 |
94 |
| -
|
95 |
| - y_id = f.remote(x_id) |
96 |
| - ray.get(y_id) # 2 |
97 |
| -
|
98 |
| -If you want a remote function to return multiple object IDs, you can do that by |
99 |
| -passing the ``num_return_vals`` argument into the remote decorator. |
100 |
| - |
101 |
| -.. code-block:: python |
102 |
| -
|
103 |
| - @ray.remote(num_return_vals=2) |
104 |
| - def f(): |
105 |
| - return 1, 2 |
106 |
| -
|
107 |
| - x_id, y_id = f.remote() |
108 |
| - ray.get(x_id) # 1 |
109 |
| - ray.get(y_id) # 2 |
110 |
| -
|
111 | 6 | .. autofunction:: ray.remote
|
112 | 7 |
|
113 |
| -Getting values from object IDs |
114 |
| ------------------------------- |
115 |
| - |
116 |
| -Object IDs can be converted into objects by calling ``ray.get`` on the object |
117 |
| -ID. Note that ``ray.get`` accepts either a single object ID or a list of object |
118 |
| -IDs. |
119 |
| - |
120 |
| -.. code-block:: python |
121 |
| -
|
122 |
| - @ray.remote |
123 |
| - def f(): |
124 |
| - return {'key1': ['value']} |
125 |
| -
|
126 |
| - # Get one object ID. |
127 |
| - ray.get(f.remote()) # {'key1': ['value']} |
128 |
| -
|
129 |
| - # Get a list of object IDs. |
130 |
| - ray.get([f.remote() for _ in range(2)]) # [{'key1': ['value']}, {'key1': ['value']}] |
131 |
| -
|
132 |
| -Numpy arrays |
133 |
| -~~~~~~~~~~~~ |
134 |
| - |
135 |
| -Numpy arrays are handled more efficiently than other data types, so **use numpy |
136 |
| -arrays whenever possible**. |
137 |
| - |
138 |
| -Any numpy arrays that are part of the serialized object will not be copied out |
139 |
| -of the object store. They will remain in the object store and the resulting |
140 |
| -deserialized object will simply have a pointer to the relevant place in the |
141 |
| -object store's memory. |
142 |
| - |
143 |
| -Since objects in the object store are immutable, this means that if you want to |
144 |
| -mutate a numpy array that was returned by a remote function, you will have to |
145 |
| -first copy it. |
146 |
| - |
147 | 8 | .. autofunction:: ray.get
|
148 | 9 |
|
149 |
| -Putting objects in the object store |
150 |
| ------------------------------------ |
151 |
| - |
152 |
| -The primary way that objects are placed in the object store is by being returned |
153 |
| -by a task. However, it is also possible to directly place objects in the object |
154 |
| -store using ``ray.put``. |
155 |
| - |
156 |
| -.. code-block:: python |
157 |
| -
|
158 |
| - x_id = ray.put(1) |
159 |
| - ray.get(x_id) # 1 |
160 |
| -
|
161 |
| -The main reason to use ``ray.put`` is that you want to pass the same large |
162 |
| -object into a number of tasks. By first doing ``ray.put`` and then passing the |
163 |
| -resulting object ID into each of the tasks, the large object is copied into the |
164 |
| -object store only once, whereas when we directly pass the object in, it is |
165 |
| -copied multiple times. |
166 |
| - |
167 |
| -.. code-block:: python |
168 |
| -
|
169 |
| - import numpy as np |
170 |
| -
|
171 |
| - @ray.remote |
172 |
| - def f(x): |
173 |
| - pass |
174 |
| -
|
175 |
| - x = np.zeros(10 ** 6) |
176 |
| -
|
177 |
| - # Alternative 1: Here, x is copied into the object store 10 times. |
178 |
| - [f.remote(x) for _ in range(10)] |
179 |
| -
|
180 |
| - # Alternative 2: Here, x is copied into the object store once. |
181 |
| - x_id = ray.put(x) |
182 |
| - [f.remote(x_id) for _ in range(10)] |
183 |
| -
|
184 |
| -Note that ``ray.put`` is called under the hood in a couple situations. |
185 |
| - |
186 |
| -- It is called on the values returned by a task. |
187 |
| -- It is called on the arguments to a task, unless the arguments are Python |
188 |
| - primitives like integers or short strings, lists, tuples, or dictionaries. |
| 10 | +.. autofunction:: ray.wait |
189 | 11 |
|
190 | 12 | .. autofunction:: ray.put
|
191 | 13 |
|
192 |
| -Waiting for a subset of tasks to finish |
193 |
| ---------------------------------------- |
194 |
| - |
195 |
| -It is often desirable to adapt the computation being done based on when |
196 |
| -different tasks finish. For example, if a bunch of tasks each take a variable |
197 |
| -length of time, and their results can be processed in any order, then it makes |
198 |
| -sense to simply process the results in the order that they finish. In other |
199 |
| -settings, it makes sense to discard straggler tasks whose results may not be |
200 |
| -needed. |
201 |
| - |
202 |
| -To do this, we introduce the ``ray.wait`` primitive, which takes a list of |
203 |
| -object IDs and returns when a subset of them are available. By default it blocks |
204 |
| -until a single object is available, but the ``num_returns`` value can be |
205 |
| -specified to wait for a different number. If a ``timeout`` argument is passed |
206 |
| -in, it will block for at most that many milliseconds and may return a list with |
207 |
| -fewer than ``num_returns`` elements. |
208 |
| - |
209 |
| -The ``ray.wait`` function returns two lists. The first list is a list of object |
210 |
| -IDs of available objects (of length at most ``num_returns``), and the second |
211 |
| -list is a list of the remaining object IDs, so the combination of these two |
212 |
| -lists is equal to the list passed in to ``ray.wait`` (up to ordering). |
213 |
| - |
214 |
| -.. code-block:: python |
215 |
| -
|
216 |
| - import time |
217 |
| - import numpy as np |
| 14 | +.. autofunction:: ray.get_gpu_ids |
218 | 15 |
|
219 |
| - @ray.remote |
220 |
| - def f(n): |
221 |
| - time.sleep(n) |
222 |
| - return n |
223 |
| -
|
224 |
| - # Start 3 tasks with different durations. |
225 |
| - results = [f.remote(i) for i in range(3)] |
226 |
| - # Block until 2 of them have finished. |
227 |
| - ready_ids, remaining_ids = ray.wait(results, num_returns=2) |
228 |
| -
|
229 |
| - # Start 5 tasks with different durations. |
230 |
| - results = [f.remote(i) for i in range(5)] |
231 |
| - # Block until 4 of them have finished or 2.5 seconds pass. |
232 |
| - ready_ids, remaining_ids = ray.wait(results, num_returns=4, timeout=2500) |
233 |
| -
|
234 |
| -It is easy to use this construct to create an infinite loop in which multiple |
235 |
| -tasks are executing, and whenever one task finishes, a new one is launched. |
236 |
| - |
237 |
| -.. code-block:: python |
238 |
| -
|
239 |
| - @ray.remote |
240 |
| - def f(): |
241 |
| - return 1 |
242 |
| -
|
243 |
| - # Start 5 tasks. |
244 |
| - remaining_ids = [f.remote() for i in range(5)] |
245 |
| - # Whenever one task finishes, start a new one. |
246 |
| - for _ in range(100): |
247 |
| - ready_ids, remaining_ids = ray.wait(remaining_ids) |
248 |
| - # Get the available object and do something with it. |
249 |
| - print(ray.get(ready_ids)) |
250 |
| - # Start a new task. |
251 |
| - remaining_ids.append(f.remote()) |
252 |
| -
|
253 |
| -.. autofunction:: ray.wait |
| 16 | +.. autofunction:: ray.get_resource_ids |
254 | 17 |
|
255 |
| -Viewing errors |
256 |
| --------------- |
| 18 | +.. autofunction:: ray.get_webui_url |
257 | 19 |
|
258 |
| -Keeping track of errors that occur in different processes throughout a cluster |
259 |
| -can be challenging. There are a couple mechanisms to help with this. |
| 20 | +.. autofunction:: ray.shutdown |
260 | 21 |
|
261 |
| -1. If a task throws an exception, that exception will be printed in the |
262 |
| - background of the driver process. |
| 22 | +.. autofunction:: ray.register_custom_serializer |
263 | 23 |
|
264 |
| -2. If ``ray.get`` is called on an object ID whose parent task threw an exception |
265 |
| - before creating the object, the exception will be re-raised by ``ray.get``. |
| 24 | +.. autofunction:: ray.profile |
266 | 25 |
|
267 |
| -The errors will also be accumulated in Redis and can be accessed with |
268 |
| -``ray.error_info``. Normally, you shouldn't need to do this, but it is possible. |
| 26 | +.. autofunction:: ray.method |
269 | 27 |
|
270 |
| -.. code-block:: python |
| 28 | +The Ray Command Line API |
| 29 | +------------------------ |
271 | 30 |
|
272 |
| - @ray.remote |
273 |
| - def f(): |
274 |
| - raise Exception("This task failed!!") |
| 31 | +.. click:: ray.scripts.scripts:start |
| 32 | + :prog: ray start |
| 33 | + :show-nested: |
275 | 34 |
|
276 |
| - f.remote() # An error message will be printed in the background. |
| 35 | +.. click:: ray.scripts.scripts:stop |
| 36 | + :prog: ray stop |
| 37 | + :show-nested: |
277 | 38 |
|
278 |
| - # Wait for the error to propagate to Redis. |
279 |
| - import time |
280 |
| - time.sleep(1) |
| 39 | +.. click:: ray.scripts.scripts:create_or_update |
| 40 | + :prog: ray create_or_update |
| 41 | + :show-nested: |
281 | 42 |
|
282 |
| - ray.error_info() # This returns a list containing the error message. |
| 43 | +.. click:: ray.scripts.scripts:teardown |
| 44 | + :prog: ray teardown |
| 45 | + :show-nested: |
283 | 46 |
|
284 |
| -.. autofunction:: ray.error_info |
| 47 | +.. click:: ray.scripts.scripts:get_head_ip |
| 48 | + :prog: ray get_head_ip |
| 49 | + :show-nested: |
0 commit comments