-
Notifications
You must be signed in to change notification settings - Fork 35
Description
The library (rank 0) writes to several files from the compute node that the SCR run scripts reference from the job script. When using NFS, caching can lead to strange behavior. For example, consider that the following sequence of commands execute in a job script:
jsrun -r 1 ./test_api (rank 0 writes to .scr/halt.scr from a compute node)
rm -f .scr/halt.scr
scr_halt --list `pwd` (attempts to read .scr/halt.scr)
When SCR_Finalize()
is called, rank 0 in test_api
writes an entry to the .scr/halt.scr
file from the compute node where rank 0 runs indicating SCR_FINALIZE_CALLED
. The subsequent rm
command should remove the halt file after the run completes, so that the following scr_halt
command should not find it.
However, on some systems scr_halt
does find .scr/halt.scr
in the state that rank 0 left it. My best guess is that this happens because the NFS client on the compute node where rank 0 runs flushes its state to NFS server after it has been deleted with the rm
command that is executed on the node that runs the job script.
One can often work around this problem by adding a sleep
, e.g.,
jsrun -r 1 ./test_api
sleep 60
rm -f .scr/halt.scr
scr_halt --list `pwd`
That sleep must wait long enough for the NFS cache timeout to expire on the rank 0 compute node.