Skip to content

App retries #149

Closed
Closed
@azzaea

Description

@azzaea

A Swif/t App may fail nondeterministically, and be tried to run in the same host and MPI rank using the TURBINE_APP_RETRIES directive.
It may be useful however to attempt to run the app in a different rank (in case one of the hosts is unavailable, or there is network issue or the like). This pull request: attempts to run the app (upon failure) in a different MPI rank.

A simple test is added, where an external app (an infinite loop ) is run in the background, and turbine attempts to kill it twice. This works fine the first time, but when attempting to kill it again, it either fails (if no retries are allowed), or retries TURBINE_APP_RETRIES times (in different ranks) and quits upon failure.

Another simple test creates a file, then attempts to delete it twice. Creation and deletion both work in the first time, and the second deletion attempt would retry on different MPI ranks until reaching TURBINE_APP_RETRIES times , at which point the script exits.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions