Skip to content

Commit

Permalink
Add section on troubleshooting to the documentation. (ray-project#583)
Browse files Browse the repository at this point in the history
* Add section on troubleshooting to the documentation.

* Address comments.

* Update file descriptor troubleshooting.
  • Loading branch information
robertnishihara authored and pcmoritz committed May 22, 2017
1 parent 06241da commit bc8b0db
Show file tree
Hide file tree
Showing 3 changed files with 128 additions and 20 deletions.
6 changes: 6 additions & 0 deletions doc/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -47,3 +47,9 @@ Ray
using-ray-on-a-cluster.rst
using-ray-on-a-large-cluster.rst
using-ray-and-docker-on-a-cluster.md

.. toctree::
:maxdepth: 1
:caption: Help

troubleshooting.rst
122 changes: 122 additions & 0 deletions doc/source/troubleshooting.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
Troubleshooting
===============

This document discusses some common problems that people run into when using Ray
as well as some known problems. If you encounter other problems, please
`let us know`_.

.. _`let us know`: https://github.com/ray-project/ray/issues

No Speedup
----------

You just ran an application using Ray, but it wasn't as fast as you expected it
to be. Or worse, perhaps it was slower than the serial version of the
application! The most common reasons are the following.

- **Number of cores:** How many cores is Ray using? When you start Ray, it will
determine the number of CPUs on each machine with ``psutil.cpu_count()``. Ray
usually will not schedule more tasks in parallel than the number of CPUs. So
if the number of CPUs is 4, the most you should expect is a 4x speedup.

- **Physical versus logical CPUs:** Do the machines you're running on have fewer
**physical** cores than **logical** cores? You can check the number of logical
cores with ``psutil.cpu_count()`` and the number of physical cores with
``psutil.cpu_count(logical=False)``. This is common on a lot of machines and
especially on EC2. For many workloads (especially numerical workloads), you
often cannot expect a greater speedup than the number of physical CPUs.

- **Small tasks:** Are your tasks very small? Ray introduces some overhead for
each task (the amount of overhead depends on the arguments that are passed
in). You will be unlikely to see speedups if your tasks take less than ten
milliseconds. For many workloads, you can easily increase the sizes of your
tasks by batching them together.

- **Variable durations:** Do your tasks have variable duration? If you run 10
tasks with variable duration in parallel, you shouldn't expect an N-fold
speedup (because you'll end up waiting for the slowest task). In this case,
consider using ``ray.wait`` to begin processing tasks that finish first.

- **Multi-threaded libraries:** Are all of your tasks attempting to use all of
the cores on the machine? If so, they are likely to experience contention and
prevent your application from achieving a speedup. You can diagnose this by
opening ``top`` while your application is running. If one process is using
most of the CPUs, and the others are using a small amount, this may be the
problem. This is very common with some versions of ``numpy``, and in that case
can usually be setting an environment variable like ``MKL_NUM_THREADS`` (or
the equivalent depending on your installation) to ``1``.

If you are still experiencing a slowdown, but none of the above problems apply,
we'd really like to know! Please create a `GitHub issue`_ and consider
submitting a minimal code example that demonstrates the problem.

.. _`Github issue`: https://github.com/ray-project/ray/issues

Crashes
-------

If Ray crashed, you may wonder what happened. Currently, this can occur for some
of the following reasons.

- **Stressful workloads:** Workloads that create many many tasks in a short
amount of time can sometimes interfere with the heartbeat mechanism that we
use to check that processes are still alive. On the head node in the cluster,
you can check the files ``/tmp/raylogs/monitor-******.out`` and
``/tmp/raylogs/monitor-******.err``. They will indicate which processes Ray
has marked as dead (due to a lack of heartbeats). However, it is currently
possible for a process to get marked as dead without actually having died.

- **Starting many actors:** Workloads that start a large number of actors all at
once may exhibit problems when the processes (or libraries that they use)
contend for resources. Similarly, a script that starts many actors over the
lifetime of the application will eventually cause the system to run out of
file descriptors. This is addressable, but currently we do not garbage collect
actor processes until the script finishes.

- **Running out of file descriptors:** As a workaround, you may be able to
increase the maximum number of file descriptors with a command like
``ulimit -n 65536``. If that fails, double check that the hard limit is
sufficiently large by running ``ulimit -Hn``. If it is too small, you can
increase the hard limit as follows (these instructions work on EC2).

* Increase the hard ulimit for open file descriptors system-wide by running
the following.

.. code-block:: bash
sudo bash -c "echo $USER hard nofile 65536 >> /etc/security/limits.conf"
* Logout and log back in.


Hanging
-------

If a workload is hanging and not progressing, the problem may be one of the
following.

- **Reconstructing an object created with put:** When an object that is needed
has been evicted or lost, Ray will attempt to rerun the task that created the
object. However, there are some cases that currently are not handled. For
example, if the object was created by a call to ``ray.put`` on the driver
process, then the argument that was passed into ``ray.put`` is no longer
available and so the call to ``ray.put`` cannot be rerun (without rerunning
the driver).

- **Reconstructing an object created by actor task:** Ray currently does not
reconstruct objects created by actor methods.

Serialization Problems
----------------------

If you encounter objects where Ray's serialization is currently imperfect. If
you encounter an object that Ray does not serialize/deserialize correctly,
please let us know. For example, you may want to bring it up on `this thread`_.

- `Objects with multiple references to the same object`_.

- `Subtypes of lists, dictionaries, or tuples`_.

.. _`this thread`: https://github.com/ray-project/ray/issues/557
.. _`Objects with multiple references to the same object`: https://github.com/ray-project/ray/issues/319
.. _`Subtypes of lists, dictionaries, or tuples`: https://github.com/ray-project/ray/issues/512
20 changes: 0 additions & 20 deletions doc/source/using-ray-on-a-large-cluster.rst
Original file line number Diff line number Diff line change
Expand Up @@ -294,23 +294,3 @@ node with agent forwarding enabled. This is done as follows.
ssh-add <ssh-key>
ssh -A ubuntu@<head-node-public-ip>
Configuring EC2 instances to increase the number of allowed Redis clients
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This section can be ignored unless you run into problems with the maximum
number of Redis clients.

* Ensure that the hard limit for the number of open file descriptors is set
to a large number (e.g., 65536). This only needs to be done on instances
where Redis shards will run --- by default, just the head node.

* Check the hard ulimit for open file descriptors with ``ulimit -Hn``.
* If that number is smaller than 65536, set the hard ulimit for open file
descriptors system-wide:

.. code-block:: bash
sudo bash -c "echo $USER hard nofile 65536 >> /etc/security/limits.conf"
* Logout and log back in.

0 comments on commit bc8b0db

Please sign in to comment.