Skip to content

Investigate fire-and-forget mode for async flush #531

@adammoody

Description

@adammoody

After an async flush has started, an application must make another SCR call to finalize that flush. Even after the async flush has copied all files, the output set is not valid until it has been finalized. The calls that finalize async flushes are: SCR_Start_output, SCR_Complete_output, and SCR_Finalize.

If an application using async flush does not write checkpoints frequently, then it could be likely that a failure occurs after all files have been copied but before the flush has been finalized. In this case, SCR will roll back to an earlier checkpoint when restarting the application. This is a shame since the hard work of copying all of the files is done.

It would be nice to extend SCR_Init so that SCR can detect an async flush is done but not yet marked as complete. To do this, we could use the file size of each file (if we trust POSIX semantics), or we could have each rank write an additional "done" flag to the file system. On restart, SCR_Init could look for these markers and update the status of the checkpoint if it finds that all files had been successfully copied. Something similar could be added to scavenge.

In the meantime, it could be useful to add checks to calls like SCR_Need_checkpoint and SCR_Should_exit, which an application may call more frequently. In that case, one might need to configure how often SCR checks, since polling for completion may be expensive on some systems. For example, if time steps are short compared to the polling cost, we would not want to poll after every time step.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions