Skip to content

FIX: SLURM plugin polling #2693

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 17, 2018
Merged

FIX: SLURM plugin polling #2693

merged 2 commits into from
Sep 17, 2018

Conversation

mgxd
Copy link
Member

@mgxd mgxd commented Sep 4, 2018

Summary

Related to #1853 .

List of changes proposed in this PR (pull-request)

  • Catch slurm squeue error when jobid has completed/terminated

Acknowledgment

  • (Mandatory) I acknowledge that this contribution will be available under the Apache 2 license.

@mgxd mgxd added this to the 1.1.3 milestone Sep 4, 2018
@mgxd
Copy link
Member Author

mgxd commented Sep 4, 2018

@agt24 @atsuch

could you test your workflow(s) with this branch and let us know if things are running smoothly?

resource_monitor=False,
terminal_output='allatonce').run()
return res.runtime.stdout.find(str(taskid)) > -1
except Exception as e:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on #1853 (comment), I think this should specifically be RuntimeError?

Otherwise, this looks reasonable.

@effigies
Copy link
Member

@agt24 @atsuch Have you had a chance to test out this branch?

Copy link
Member

@effigies effigies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable. If we don't hear back by the end of the week, I'm okay merging, and we can see if upgrading to 1.1.3 fixes people's SLURM problems.

@agt24
Copy link

agt24 commented Sep 17, 2018

@dalejn tried to run this on Friday, but it crashed. Here's the error:
`
Traceback (most recent call last):

File "/data/zhoud4/python/envs/NiPypeUpdate/lib/python3.5/site-packages/nipype/pipeline/plugins/base.py", line 154, in run
result = self._get_result(taskid)

File "/data/zhoud4/python/envs/NiPypeUpdate/lib/python3.5/site-packages/nipype/pipeline/plugins/base.py", line 462, in _get_result
if self._is_pending(taskid):

File "/data/zhoud4/python/envs/NiPypeUpdate/lib/python3.5/site-packages/nipype/pipeline/plugins/slurm.py", line 74, in _is_pending
raise(e)

File "/data/zhoud4/python/envs/NiPypeUpdate/lib/python3.5/site-packages/nipype/pipeline/plugins/slurm.py", line 70, in _is_pending
terminal_output='allatonce').run()

File "/data/zhoud4/python/envs/NiPypeUpdate/lib/python3.5/site-packages/nipype/interfaces/base/core.py", line 521, in run
runtime = self._run_interface(runtime)

File "/data/zhoud4/python/envs/NiPypeUpdate/lib/python3.5/site-packages/nipype/interfaces/base/core.py", line 1033, in _run_interface
self.raise_exception(runtime)

File "/data/zhoud4/python/envs/NiPypeUpdate/lib/python3.5/site-packages/nipype/interfaces/base/core.py", line 970, in raise_exception
).format(**runtime.dictcopy()))

RuntimeError: Command:
squeue -j 9448406
Standard output:

Standard error:
slurm_load_jobs error: Socket timed out on send/recv operation
Return code: 1
`

@mgxd
Copy link
Member Author

mgxd commented Sep 17, 2018

@agt24 that error looks like your SLURM master is getting overloaded with requests, are there a surplus of jobs currently queued or running through SLURM?

We should still merge this patch in for 1.1.3

@agt24
Copy link

agt24 commented Sep 17, 2018

Yes, there are limits on how many jobs a single user can have in the queue. This is likely the cause.

@dalejn will try again with max_jobs set to something manageable.

Agree that the patch can be merged.

@effigies
Copy link
Member

Sounds good. Merging. If there's something else sensible to be done on overload (Better error message? Wait and retry?), please open a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants