Initial fault tolerance documentation. (ray-project#845)

* Initial fault tolerance documentation. * Update documentation.
hamidmaei · Aug 20, 2017 · af71f96 · af71f96
1 parent ea8da13
commit af71f96
Show file tree

Hide file tree

Showing 2 changed files with 61 additions and 0 deletions.
diff --git a/doc/source/fault-tolerance.rst b/doc/source/fault-tolerance.rst
@@ -0,0 +1,60 @@
+Fault Tolerance
+===============
+
+This document describes the handling of failures in Ray.
+
+Machine and Process Failures
+----------------------------
+
+Currently, each **local scheduler** and each **plasma manager** send heartbeats
+to a **monitor** process. If the monitor does not receive any heartbeats from a
+given process for some duration of time (about ten seconds), then it will mark
+that process as dead. The monitor process will then clean up the associated
+state in the Redis servers. If a manager is marked as dead, the object table
+will be updated to remove all occurrences of that manager so that other managers
+don't try to fetch objects from the dead manager. If a local scheduler is marked
+as dead, all of the tasks that are marked as executing on that local scheduler
+in the task table will be marked as lost and all actors associated with that
+local scheduler will be recreated by other local schedulers.
+
+Lost Objects
+------------
+
+If an object is needed but is lost or was never created, then the task that
+created the object will be re-executed to create the object. If necessary, tasks
+needed to create the input arguments to the task being re-executed will also be
+re-executed.
+
+Actors
+------
+
+When a local scheduler is marked as dead, all actors associated with that local
+scheduler that were still alive will be recreated by other local schedulers. By
+default, all of the actor methods will be re-executed in the same order that
+they were initially executed. If actor checkpointing is enabled, then the actor
+state will be loaded from the most recent checkpoint and the actor methods that
+occurred after the checkpoint will be re-executed. Note that actor checkpointing
+is currently an experimental feature.
+
+
+Unhandled Failures
+------------------
+
+At the moment, Ray does not handle all failure scenarios. We are working on
+addressing these problems.
+
+Process Failures
+~~~~~~~~~~~~~~~~
+
+1. Ray does not recover from the failure of any of the following processes:
+   a Redis server, the global scheduler, the monitor process.
+2. If a driver fails, that driver will not be restarted and the job will not
+   complete.
+
+Lost Objects
+~~~~~~~~~~~~
+
+1. If an object is constructed by a call to ``ray.put`` on the driver, is then
+   evicted, and is later needed, Ray will not reconstruct this object.
+2. If an object is constructed by an actor method, is then evicted, and is later
+   needed, Ray will not reconstruct this object.
diff --git a/doc/source/index.rst b/doc/source/index.rst
@@ -40,6 +40,7 @@ Ray
 
    internals-overview.rst
    serialization.rst
+   fault-tolerance.rst
 
 .. toctree::
    :maxdepth: 1