-
Notifications
You must be signed in to change notification settings - Fork 532
FIX: SLURM plugin polling #2693
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
nipype/pipeline/plugins/slurm.py
Outdated
resource_monitor=False, | ||
terminal_output='allatonce').run() | ||
return res.runtime.stdout.find(str(taskid)) > -1 | ||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on #1853 (comment), I think this should specifically be RuntimeError
?
Otherwise, this looks reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks reasonable. If we don't hear back by the end of the week, I'm okay merging, and we can see if upgrading to 1.1.3 fixes people's SLURM problems.
@dalejn tried to run this on Friday, but it crashed. Here's the error: File "/data/zhoud4/python/envs/NiPypeUpdate/lib/python3.5/site-packages/nipype/pipeline/plugins/base.py", line 154, in run File "/data/zhoud4/python/envs/NiPypeUpdate/lib/python3.5/site-packages/nipype/pipeline/plugins/base.py", line 462, in _get_result File "/data/zhoud4/python/envs/NiPypeUpdate/lib/python3.5/site-packages/nipype/pipeline/plugins/slurm.py", line 74, in _is_pending File "/data/zhoud4/python/envs/NiPypeUpdate/lib/python3.5/site-packages/nipype/pipeline/plugins/slurm.py", line 70, in _is_pending File "/data/zhoud4/python/envs/NiPypeUpdate/lib/python3.5/site-packages/nipype/interfaces/base/core.py", line 521, in run File "/data/zhoud4/python/envs/NiPypeUpdate/lib/python3.5/site-packages/nipype/interfaces/base/core.py", line 1033, in _run_interface File "/data/zhoud4/python/envs/NiPypeUpdate/lib/python3.5/site-packages/nipype/interfaces/base/core.py", line 970, in raise_exception RuntimeError: Command: Standard error: |
@agt24 that error looks like your SLURM master is getting overloaded with requests, are there a surplus of jobs currently queued or running through SLURM? We should still merge this patch in for 1.1.3 |
Yes, there are limits on how many jobs a single user can have in the queue. This is likely the cause. @dalejn will try again with max_jobs set to something manageable. Agree that the patch can be merged. |
Sounds good. Merging. If there's something else sensible to be done on overload (Better error message? Wait and retry?), please open a new issue. |
Summary
Related to #1853 .
List of changes proposed in this PR (pull-request)
squeue
error when jobid has completed/terminatedAcknowledgment