Skip to content

Commit

Permalink
Initial fault tolerance documentation. (ray-project#845)
Browse files Browse the repository at this point in the history
* Initial fault tolerance documentation.

* Update documentation.
  • Loading branch information
robertnishihara authored and pcmoritz committed Aug 20, 2017
1 parent ea8da13 commit af71f96
Show file tree
Hide file tree
Showing 2 changed files with 61 additions and 0 deletions.
60 changes: 60 additions & 0 deletions doc/source/fault-tolerance.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
Fault Tolerance
===============

This document describes the handling of failures in Ray.

Machine and Process Failures
----------------------------

Currently, each **local scheduler** and each **plasma manager** send heartbeats
to a **monitor** process. If the monitor does not receive any heartbeats from a
given process for some duration of time (about ten seconds), then it will mark
that process as dead. The monitor process will then clean up the associated
state in the Redis servers. If a manager is marked as dead, the object table
will be updated to remove all occurrences of that manager so that other managers
don't try to fetch objects from the dead manager. If a local scheduler is marked
as dead, all of the tasks that are marked as executing on that local scheduler
in the task table will be marked as lost and all actors associated with that
local scheduler will be recreated by other local schedulers.

Lost Objects
------------

If an object is needed but is lost or was never created, then the task that
created the object will be re-executed to create the object. If necessary, tasks
needed to create the input arguments to the task being re-executed will also be
re-executed.

Actors
------

When a local scheduler is marked as dead, all actors associated with that local
scheduler that were still alive will be recreated by other local schedulers. By
default, all of the actor methods will be re-executed in the same order that
they were initially executed. If actor checkpointing is enabled, then the actor
state will be loaded from the most recent checkpoint and the actor methods that
occurred after the checkpoint will be re-executed. Note that actor checkpointing
is currently an experimental feature.


Unhandled Failures
------------------

At the moment, Ray does not handle all failure scenarios. We are working on
addressing these problems.

Process Failures
~~~~~~~~~~~~~~~~

1. Ray does not recover from the failure of any of the following processes:
a Redis server, the global scheduler, the monitor process.
2. If a driver fails, that driver will not be restarted and the job will not
complete.

Lost Objects
~~~~~~~~~~~~

1. If an object is constructed by a call to ``ray.put`` on the driver, is then
evicted, and is later needed, Ray will not reconstruct this object.
2. If an object is constructed by an actor method, is then evicted, and is later
needed, Ray will not reconstruct this object.
1 change: 1 addition & 0 deletions doc/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ Ray

internals-overview.rst
serialization.rst
fault-tolerance.rst

.. toctree::
:maxdepth: 1
Expand Down

0 comments on commit af71f96

Please sign in to comment.