Add section on troubleshooting to the documentation. (ray-project#583)

* Add section on troubleshooting to the documentation. * Address comments. * Update file descriptor troubleshooting.
chagge · May 22, 2017 · bc8b0db · bc8b0db
1 parent 06241da
commit bc8b0db
Show file tree

Hide file tree

Showing 3 changed files with 128 additions and 20 deletions.
diff --git a/doc/source/index.rst b/doc/source/index.rst
@@ -47,3 +47,9 @@ Ray
    using-ray-on-a-cluster.rst
    using-ray-on-a-large-cluster.rst
    using-ray-and-docker-on-a-cluster.md
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Help
+
+   troubleshooting.rst
diff --git a/doc/source/troubleshooting.rst b/doc/source/troubleshooting.rst
@@ -0,0 +1,122 @@
+Troubleshooting
+===============
+
+This document discusses some common problems that people run into when using Ray
+as well as some known problems. If you encounter other problems, please
+`let us know`_.
+
+.. _`let us know`: https://github.com/ray-project/ray/issues
+
+No Speedup
+----------
+
+You just ran an application using Ray, but it wasn't as fast as you expected it
+to be. Or worse, perhaps it was slower than the serial version of the
+application! The most common reasons are the following.
+
+- **Number of cores:** How many cores is Ray using? When you start Ray, it will
+  determine the number of CPUs on each machine with ``psutil.cpu_count()``. Ray
+  usually will not schedule more tasks in parallel than the number of CPUs. So
+  if the number of CPUs is 4, the most you should expect is a 4x speedup.
+
+- **Physical versus logical CPUs:** Do the machines you're running on have fewer
+  **physical** cores than **logical** cores? You can check the number of logical
+  cores with ``psutil.cpu_count()`` and the number of physical cores with
+  ``psutil.cpu_count(logical=False)``. This is common on a lot of machines and
+  especially on EC2. For many workloads (especially numerical workloads), you
+  often cannot expect a greater speedup than the number of physical CPUs.
+
+- **Small tasks:** Are your tasks very small? Ray introduces some overhead for
+  each task (the amount of overhead depends on the arguments that are passed
+  in). You will be unlikely to see speedups if your tasks take less than ten
+  milliseconds. For many workloads, you can easily increase the sizes of your
+  tasks by batching them together.
+
+- **Variable durations:** Do your tasks have variable duration? If you run 10
+  tasks with variable duration in parallel, you shouldn't expect an N-fold
+  speedup (because you'll end up waiting for the slowest task). In this case,
+  consider using ``ray.wait`` to begin processing tasks that finish first.
+
+- **Multi-threaded libraries:** Are all of your tasks attempting to use all of
+  the cores on the machine? If so, they are likely to experience contention and
+  prevent your application from achieving a speedup. You can diagnose this by
+  opening ``top`` while your application is running. If one process is using
+  most of the CPUs, and the others are using a small amount, this may be the
+  problem. This is very common with some versions of ``numpy``, and in that case
+  can usually be setting an environment variable like ``MKL_NUM_THREADS`` (or
+  the equivalent depending on your installation) to ``1``.
+
+If you are still experiencing a slowdown, but none of the above problems apply,
+we'd really like to know! Please create a `GitHub issue`_ and consider
+submitting a minimal code example that demonstrates the problem.
+
+.. _`Github issue`: https://github.com/ray-project/ray/issues
+
+Crashes
+-------
+
+If Ray crashed, you may wonder what happened. Currently, this can occur for some
+of the following reasons.
+
+- **Stressful workloads:** Workloads that create many many tasks in a short
+  amount of time can sometimes interfere with the heartbeat mechanism that we
+  use to check that processes are still alive. On the head node in the cluster,
+  you can check the files ``/tmp/raylogs/monitor-******.out`` and
+  ``/tmp/raylogs/monitor-******.err``. They will indicate which processes Ray
+  has marked as dead (due to a lack of heartbeats). However, it is currently
+  possible for a process to get marked as dead without actually having died.
+
+- **Starting many actors:** Workloads that start a large number of actors all at
+  once may exhibit problems when the processes (or libraries that they use)
+  contend for resources. Similarly, a script that starts many actors over the
+  lifetime of the application will eventually cause the system to run out of
+  file descriptors. This is addressable, but currently we do not garbage collect
+  actor processes until the script finishes.
+
+- **Running out of file descriptors:** As a workaround, you may be able to
+  increase the maximum number of file descriptors with a command like
+  ``ulimit -n 65536``. If that fails, double check that the hard limit is
+  sufficiently large by running ``ulimit -Hn``. If it is too small, you can
+  increase the hard limit as follows (these instructions work on EC2).
+
+    * Increase the hard ulimit for open file descriptors system-wide by running
+      the following.
+
+      .. code-block:: bash
+
+        sudo bash -c "echo $USER hard nofile 65536 >> /etc/security/limits.conf"
+
+    * Logout and log back in.
+
+
+Hanging
+-------
+
+If a workload is hanging and not progressing, the problem may be one of the
+following.
+
+- **Reconstructing an object created with put:** When an object that is needed
+  has been evicted or lost, Ray will attempt to rerun the task that created the
+  object. However, there are some cases that currently are not handled. For
+  example, if the object was created by a call to ``ray.put`` on the driver
+  process, then the argument that was passed into ``ray.put`` is no longer
+  available and so the call to ``ray.put`` cannot be rerun (without rerunning
+  the driver).
+
+- **Reconstructing an object created by actor task:** Ray currently does not
+  reconstruct objects created by actor methods.
+
+Serialization Problems
+----------------------
+
+If you encounter objects where Ray's serialization is currently imperfect. If
+you encounter an object that Ray does not serialize/deserialize correctly,
+please let us know. For example, you may want to bring it up on `this thread`_.
+
+- `Objects with multiple references to the same object`_.
+
+- `Subtypes of lists, dictionaries, or tuples`_.
+
+.. _`this thread`: https://github.com/ray-project/ray/issues/557
+.. _`Objects with multiple references to the same object`: https://github.com/ray-project/ray/issues/319
+.. _`Subtypes of lists, dictionaries, or tuples`: https://github.com/ray-project/ray/issues/512
diff --git a/doc/source/using-ray-on-a-large-cluster.rst b/doc/source/using-ray-on-a-large-cluster.rst
@@ -294,23 +294,3 @@ node with agent forwarding enabled. This is done as follows.
 
   ssh-add <ssh-key>
   ssh -A ubuntu@<head-node-public-ip>
-
-Configuring EC2 instances to increase the number of allowed Redis clients
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-This section can be ignored unless you run into problems with the maximum
-number of Redis clients.
-
-* Ensure that the hard limit for the number of open file descriptors is set
-  to a large number (e.g., 65536). This only needs to be done on instances
-  where Redis shards will run --- by default, just the head node.
-
-    * Check the hard ulimit for open file descriptors with ``ulimit -Hn``.
-    * If that number is smaller than 65536, set the hard ulimit for open file
-      descriptors system-wide:
-
-      .. code-block:: bash
-
-        sudo bash -c "echo $USER hard nofile 65536 >> /etc/security/limits.conf"
-
-  * Logout and log back in.