@@ -120,12 +120,9 @@ runtime).
120120Optionally, you can specify ``--with-ft ulfm `` to ensure that ULFM support
121121is definitely built.
122122
123- Support notes
124- ^^^^^^^^^^^^^
125-
126- * ULFM Fault Tolerance does not apply to OpenSHMEM. It is recomended
127- that if you are going to use ULFM, you should disable building
128- OpenSHMEM with ``--disable-oshmem ``.
123+ .. note :: ULFM Fault Tolerance does not apply to OpenSHMEM. It is recomended
124+ that if you are going to use ULFM, you should disable building OpenSHMEM
125+ with ``--disable-oshmem ``.
129126
130127Running ULFM Open MPI
131128---------------------
@@ -151,30 +148,33 @@ the normal Open MPI ``mpirun`` launcher, with the
151148
152149 shell$ mpirun --with-ft ulfm ...
153150
154- .. important:: By default, fault tolerance is not active at run time.
155- It must be enabled via `--with-ft ulfm`.
151+ .. important :: By default, fault tolerance is not active at run time.
152+ It must be enabled via `` --with-ft ulfm ` `.
156153
157154Running under a batch scheduler
158155^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
159156
160157ULFM can operate under a job/batch scheduler, and is tested routinely
161158with ALPS, PBS, and Slurm. One difficulty comes from the fact that
162- many job schedulers will "cleanup" the application as soon as any
163- process fails. In order to avoid this problem, it is preferred that
164- you use ``mpirun `` within an allocation (e.g., ``salloc ``,
165- ``sbatch ``, ``qsub ``) rather than a direct launch (e.g., ``srun ``).
159+ many job schedulers handle failures by triggering an immediate "cleanup"
160+ of the application as soon as any process fails. In addition, failure
161+ detection subsystems integrated into PRTE are not active in direct launch
162+ scenarios and may not have a launcher specific alternative. This may cause
163+ the application to not detect failures and lock. In order to avoid these
164+ problems, it is preferred that you use ``mpirun `` within an allocation
165+ (e.g., ``salloc ``, ``sbatch ``, ``qsub ``) rather than a direct launch.
166166
167167* SLURM is tested and supported with fault tolerance.
168168
169- .. important :: Do not use ``srun``, or your application gets killed
170- by the scheduler upon the first failure. Instead,
171- use ``mpirun `` in an ``salloc `` or ``sbatch `` allocation.
172-
173- * LSF is untested with fault tolerance.
169+ .. important :: Use ``mpirun`` in an ``salloc`` or ``sbatch`` allocation.
170+ Direct launch with ``srun `` is not supported.
174171
175172* PBS/Torque is tested and supported with fault tolerance.
176173
177- .. important :: Be sure to use ``mpirun`` in a ``qsub`` allocation.
174+ .. important :: Use ``mpirun`` in a ``qsub`` allocation. Direct launch
175+ with ``aprun `` is not supported.
176+
177+ * LSF is untested with fault tolerance.
178178
179179Run-time tuning knobs
180180^^^^^^^^^^^^^^^^^^^^^
@@ -185,13 +185,12 @@ most cases. You can change the default settings with ``--mca
185185mpi_ft_foo <value> `` for Open MPI options, and with ``--prtemca
186186errmgr_detector_bar <value> `` for PRTE options.
187187
188- .. important :: The main control for enabling/disabling fault tolerance
189- at runtime is the ``--with-ft ulfm `` (or its synomym
190- ``--with-ft mpi ``) ``mpirun `` CLI option. This option
191- sets up multiple subsystems in Open MPI to enable fault
192- tolerance. The options described below are best used to
193- overide the default behavior after the ``--with-ft ulfm ``
194- opion is used.
188+ .. important :: The main control for enabling/disabling fault tolerance
189+ at runtime is the ``--with-ft ulfm `` (or its synomym ``--with-ft mpi ``)
190+ ``mpirun `` CLI option. This option sets up multiple subsystems in
191+ Open MPI to enable fault tolerance. The options described below are
192+ best used to overide the default behavior after the ``--with-ft ulfm ``
193+ opion is used.
195194
196195PRTE level options
197196~~~~~~~~~~~~~~~~~~
@@ -225,12 +224,11 @@ runtime behavior of ULFM by either setting or unsetting variables in
225224this file, or by overiding the variable on the command line (e.g.,
226225``--mca btl ofi,self ``).
227226
228- .. important :: Note that if fault tolerance is disabled at runtime,
229- (that is, when not using ``--with-ft ulfm ``), the
230- ``ft-mpi `` AMCA param file is not loaded, thus
231- components that are unsafe for fault tolerance will
232- load normally (this may change observed performance
233- when comparing with and without fault tolerance).
227+ .. important :: Note that if fault tolerance is disabled at runtime,
228+ (that is, when not using ``--with-ft ulfm ``), the ``ft-mpi `` AMCA
229+ param file is not loaded, thus components that are unsafe for fault
230+ tolerance will load normally (this may change observed performance
231+ when comparing with and without fault tolerance).
234232
235233* ``mpi_ft_enable <true|false> (default: false) ``
236234 permits turning on/off fault tolerance at runtime. This option is
@@ -254,23 +252,33 @@ this file, or by overiding the variable on the command line (e.g.,
254252 ``MPI_COMM_WORLD `` exclusively. Processes connected from
255253 ``MPI_COMM_CONNECT ``/``ACCEPT `` and ``MPI_COMM_SPAWN `` may
256254 occasionally not be detected when they fail.
255+
256+ .. caution :: This component is deprecated. Failure detection is now
257+ performed at the PRTE level. See the section above on controlling
258+ PRTE behavior for information about how to tune the failure detector.
259+
257260* ``mpi_ft_detector_thread <true|false> (default: false) `` controls
258261 the use of a thread to emit and receive failure detector's
259262 heartbeats. *Setting this value to "true" will also set
260263 MPI_THREAD_MULTIPLE support, which has a noticeable effect on
261264 latency (typically 1us increase). * You may want to **enable this
262265 option if you experience false positive ** processes incorrectly
263266 reported as failed with the Open MPI failure detector.
264- This option is only relevant when ``mpi_ft_detector `` is ``true ``.
267+
268+ .. important :: This option is only relevant when ``mpi_ft_detector`` is ``true``.
269+
265270* ``mpi_ft_detector_period <float> (default: 3e0 seconds) `` heartbeat
266271 period. Recommended value is 1/3 of the timeout. _Values lower than
267272 100us may impart a noticeable effect on latency (typically a 3us
268273 increase)._
269- This option is only relevant when ``mpi_ft_detector `` is ``true ``.
274+
275+ .. important :: This option is only relevant when ``mpi_ft_detector`` is ``true``.
276+
270277* ``mpi_ft_detector_timeout <float> (default: 1e1 seconds) `` heartbeat
271278 timeout (i.e. failure detection speed). Recommended value is 3 times
272279 the heartbeat period.
273- This option is only relevant when ``mpi_ft_detector `` is ``true ``.
280+
281+ .. important :: This option is only relevant when ``mpi_ft_detector`` is ``true``.
274282
275283Known Limitations in ULFM
276284-------------------------
@@ -287,27 +295,27 @@ Frameworks and components are listed below and categorized into one of
287295three classifications:
288296
2892971. **Modified: ** This framework/component has been specifically modified
290- such that it will continue to work after a failure.
298+ such that it will continue to work after a failure.
2912992. **Untested: ** This framework/component has not been modified and/or
292- tested with fault tolerance scenarios, and _may_ malfunction
293- after a failure.
300+ tested with fault tolerance scenarios, and _may_ malfunction
301+ after a failure.
2943023. **Disabled: ** This framework/component will cause unspecified behavior when
295- fault tolerance is enabled.
303+ fault tolerance is enabled.
296304
297305Any framework or component not listed below are categorized as **Unmodified **,
298306meaning that it is unmodified for fault tolerance, but will continue to work
299307correctly after a failure.
300308
301309* ``pml ``: MPI point-to-point management layer
302310
303- * ``monitoring ``, ``v ``: **untested ** (they have not been modified
304- to handle faults)
311+ * ``monitoring ``, ``v ``: **untested ** (they have not been modified to handle
312+ faults)
305313 * ``cm ``, ``crcpw ``, ``ucx ``: **disabled **
306314
307315* ``btl ``: Point-to-point Byte Transfer Layer
308316
309- * ``ofi ``, ``portals4 ``, ``smcuda ``, ``usnic ``, ``sm(+knem) ``:
310- ** untested ** (they may work properly, please report)
317+ * ``ofi ``, ``portals4 ``, ``smcuda ``, ``usnic ``, ``sm(+knem) ``: ** untested **
318+ (they may work properly, please report)
311319
312320* ``mtl ``: Matching transport layer Used for MPI point-to-point messages on
313321 some types of networks
0 commit comments