Investigate fire-and-forget mode for async flush

After an async flush has started, an application must make another SCR call to finalize that flush.  Even after the async flush has copied all files, the output set is not valid until it has been finalized.  The calls that finalize async flushes are: ``SCR_Start_output``, ``SCR_Complete_output``, and ``SCR_Finalize``.

If an application using async flush does not write checkpoints frequently, then it could be likely that a failure occurs after all files have been copied but before the flush has been finalized.  In this case, SCR will roll back to an earlier checkpoint when restarting the application.  This is a shame since the hard work of copying all of the files is done.

It would be nice to extend ``SCR_Init`` so that SCR can detect an async flush is done but not yet marked as complete.  To do this, we could use the file size of each file (if we trust POSIX semantics), or we could have each rank write an additional "done" flag to the file system.  On restart, ``SCR_Init`` could look for these markers and update the status of the checkpoint if it finds that all files had been successfully copied.  Something similar could be added to scavenge.

In the meantime, it could be useful to add checks to calls like ``SCR_Need_checkpoint`` and ``SCR_Should_exit``, which an application may call more frequently.  In that case, one might need to configure how often SCR checks, since polling for completion may be expensive on some systems.  For example, if time steps are short compared to the polling cost, we would not want to poll after every time step.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Investigate fire-and-forget mode for async flush #531

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate fire-and-forget mode for async flush #531

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions