You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Not sure if this is a bug or specific for my use case.
When running rail in parallel mode using ipcluster with Slurm I get a RuntimeError that isofrags.tar.gz does not exist. If I restart from that point everything finishes cleanly.
If I run rail in parallel on a single node with ipcluster (i.e. local instead of slurm) everything runs cleanly.
I am guessing it has something to do with using slurm. Probably not your problem, only bring it up because there is a mention of this in a commit log on the parallel branch. Please let me know if you have a known fix or a suggestion what might be going on.
Thanks
Justin
The text was updated successfully, but these errors were encountered:
Thanks for the bug report! So the error output is exactly The file isofrags.tar.gz does not exist and thus cannot be cached.?
Sounds like a race condition. Still somewhat mysterious to me, but in dooplicity/emr_simulator.py try replacing
if not os.path.isfile(file_or_archive):
iface.fail(('The file %s does not exist and thus cannot '
'be cached.') % file_or_archive,
steps=(job_flow[step_number:]
if step_number != 0 else None))
failed = True
raise RuntimeError
(lines 1422-1427) with something like
retries = 0
while not os.path.isfile(file_or_archive):
time.sleep(1)
retries += 1
if retries > 5: break
if not os.path.isfile(file_or_archive):
iface.fail(('The file %s does not exist and thus cannot '
'be cached.') % file_or_archive,
steps=(job_flow[step_number:]
if step_number != 0 else None))
failed = True
raise RuntimeError
Not sure if this is a bug or specific for my use case.
When running rail in parallel mode using ipcluster with Slurm I get a RuntimeError that isofrags.tar.gz does not exist. If I restart from that point everything finishes cleanly.
If I run rail in parallel on a single node with ipcluster (i.e. local instead of slurm) everything runs cleanly.
I am guessing it has something to do with using slurm. Probably not your problem, only bring it up because there is a mention of this in a commit log on the parallel branch. Please let me know if you have a known fix or a suggestion what might be going on.
Thanks
Justin
The text was updated successfully, but these errors were encountered: