Skip to content

NFS caching between compute node and node running job script leads to inconsistent behavior #337

@adammoody

Description

@adammoody

The library (rank 0) writes to several files from the compute node that the SCR run scripts reference from the job script. When using NFS, caching can lead to strange behavior. For example, consider that the following sequence of commands execute in a job script:

jsrun -r 1 ./test_api (rank 0 writes to .scr/halt.scr from a compute node)
rm -f .scr/halt.scr
scr_halt --list `pwd` (attempts to read .scr/halt.scr)

When SCR_Finalize() is called, rank 0 in test_api writes an entry to the .scr/halt.scr file from the compute node where rank 0 runs indicating SCR_FINALIZE_CALLED. The subsequent rm command should remove the halt file after the run completes, so that the following scr_halt command should not find it.

However, on some systems scr_halt does find .scr/halt.scr in the state that rank 0 left it. My best guess is that this happens because the NFS client on the compute node where rank 0 runs flushes its state to NFS server after it has been deleted with the rm command that is executed on the node that runs the job script.

One can often work around this problem by adding a sleep, e.g.,

jsrun -r 1 ./test_api
sleep 60
rm -f .scr/halt.scr
scr_halt --list `pwd`

That sleep must wait long enough for the NFS cache timeout to expire on the rank 0 compute node.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions