@@ -6,6 +6,17 @@ User-Level Fault Mitigation (ULFM)
66This chapter documents the features and options specific to the **User
77Level Failure Mitigation (ULFM) ** Open MPI implementation.
88
9+ TL;DR
10+ -----
11+ This is an extremely terse summary of how to use ULFM:
12+
13+ .. code-block ::
14+
15+ ./configure --with-ft=ulfm [...options...]
16+ make [-j N] all install
17+ mpicc my-ft-program.c -o my-ft-program
18+ mpiexec -n 4 --with-ft ulfm my-ft-program
19+
920 Features
1021--------
1122
@@ -100,11 +111,11 @@ Available from: https://journals.sagepub.com/doi/10.1177/1094342013488238.
100111Building ULFM support in Open MPI
101112---------------------------------
102113
103- In Open MPI |ompi_ver |, ULFM support is **enabled by default ** |mdash |
104- when you build Open MPI, unless you specify ``--without-ft ``, ULFM
114+ In Open MPI |ompi_ver |, ULFM support is **built-in by default ** |mdash |
115+ that is, when you build Open MPI, unless you specify ``--without-ft ``, ULFM
105116support will automatically be built.
106117
107- Optionally, you can specify ``--with-ft `` to ensure that ULFM support
118+ Optionally, you can specify ``--with-ft ulfm `` to ensure that ULFM support
108119is definitely built.
109120
110121Support notes
@@ -215,7 +226,7 @@ Running your application
215226
216227You can launch your application with fault tolerance by simply using
217228the normal Open MPI ``mpiexec `` launcher, with the
218- ``--with-ft ulfm `` CLI option:
229+ ``--with-ft ulfm `` CLI option (or its synonym `` --with-ft mpi ``) :
219230
220231.. code-block ::
221232
@@ -234,6 +245,11 @@ you use ``mpiexec`` within an allocation (e.g., ``salloc``,
234245Run-time tuning knobs
235246^^^^^^^^^^^^^^^^^^^^^
236247
248+ The main control for enabling/disabling fault tolerance at runtime
249+ is the ``--with-ft ulfm `` (or its synomym ``--with-ft mpi ``) ``mpiexec ``
250+ CLI option. This option will setup multiple subsystems of Open MPI
251+ to enable fault tolerance.
252+
237253ULFM comes with a variety of knobs for controlling how it runs. The
238254default parameters are sane and should result in good performance in
239255most cases. You can change the default settings with ``--mca
@@ -243,9 +259,10 @@ errmgr_detector_bar <value>`` for PRTE options.
243259PRTE level options
244260~~~~~~~~~~~~~~~~~~
245261
246- * ``prrte_enable_recovery <true|false> (default: false) `` controls
262+ * ``prrte_enable_ft <true|false> (default: false) `` controls
247263 automatic cleanup of apps with failed processes within
248- mpirun. Enabling this option also enables ``mpi_ft_enable ``.
264+ mpirun. This option is automatically set to ``true `` when using
265+ ``--with-ft ulfm ``.
249266* ``errmgr_detector_priority <int> (default 1005 ``) selects the
250267 PRRTE-based failure detector. Only available when
251268 ``prte_enable_recovery `` is ``true ``. You can set this to ``0 `` when
@@ -263,17 +280,29 @@ PRTE level options
263280Open MPI level options
264281~~~~~~~~~~~~~~~~~~~~~~
265282
266- * ``mpi_ft_enable <true|false> (default: same as
267- prrte_enable_recovery) `` permits turning on/off fault tolerance at
268- runtime. When false, failure detection is disabled; Interfaces
269- defined by the fault tolerance extensions are substituted with dummy
270- non-fault tolerant implementations (e.g., ``MPIX_Comm_agree `` is
271- implemented with ``MPI_Allreduce ``); All other controls below become
272- irrelevant.
283+ Some default values are applied to some Open MPI parameters when using
284+ ``mpiexec --with-ft ulfm ``. These defaults are obtained from the ``ft-mpi ``
285+ aggregate MCA param file
286+ ``$installdir/share/openmpi/amca-param-sets/ft-mpi ``. You can tune the
287+ runtime behavior with ULFM by either setting or unsetting variables in
288+ this file, or by overiding the variable on the command line (e.g.,
289+ ``--mca btl ofi,self ``). Note that if fault tolerance is not enabled at
290+ runtime (that is, when not using ``--with-ft ulfm ``), this param file is
291+ not loaded, which may change which components are selected (this in turn
292+ may change observed performance when comparing with and without fault
293+ tolerance).
294+
295+ * ``mpi_ft_enable <true|false> (default: false) ``
296+ permits turning on/off fault tolerance at runtime. This option is
297+ automatically set to ``true `` from the aggregate MCA param file
298+ ``ft-mpi `` loaded when using ``--with-ft ulfm ``. When false, failure
299+ detection is disabled; Interfaces defined by the fault tolerance extensions
300+ are substituted with dummy non-fault tolerant implementations (e.g.,
301+ ``MPIX_Comm_agree `` is implemented with ``MPI_Allreduce ``); All other
302+ controls below become irrelevant.
273303* ``mpi_ft_verbose <int> (default: 0) `` increases the output of the
274304 fault tolerance activities. A value of 1 will report detected
275- failures.
276- * ``mpi_ft_detector <true|false> (default: false) ``, **EXPERIMENTAL **
305+ failuresulfm ``mpi_ft_detector <true|false> (default: false) ``, **DEPRECATED **
277306 controls the activation of the Open MPI level failure detector. When
278307 this detector is turned off, all failure detection is delegated to
279308 PRTE (see above). The Open MPI level fault detector is
@@ -291,13 +320,16 @@ Open MPI level options
291320 latency (typically 1us increase). * You may want to **enable this
292321 option if you experience false positive ** processes incorrectly
293322 reported as failed with the Open MPI failure detector.
323+ This option is only relevant when `mpi_ft_detector ` is `true `.
294324* ``mpi_ft_detector_period <float> (default: 3e0 seconds) `` heartbeat
295325 period. Recommended value is 1/3 of the timeout. _Values lower than
296326 100us may impart a noticeable effect on latency (typically a 3us
297327 increase)._
328+ This option is only relevant when `mpi_ft_detector ` is `true `.
298329* ``mpi_ft_detector_timeout <float> (default: 1e1 seconds) `` heartbeat
299330 timeout (i.e. failure detection speed). Recommended value is 3 times
300331 the heartbeat period.
332+ This option is only relevant when `mpi_ft_detector ` is `true `.
301333
302334Known Limitations in ULFM
303335^^^^^^^^^^^^^^^^^^^^^^^^^
0 commit comments