Skip to content

Req to deal with SLURM socket errors more patiently #2766

Closed
@agt24

Description

@agt24

Summary

At end of issue #2693 @effigies noted that the error that @dalejn was experiencing was due to the SLURM master throwing an error when it was polled with squeue, possibly because it was busy. After some further testing, we now believe that the NIH HPC SLURM master will throw this error at least once a day even with a modest polling interval.

We would like to request a patch such that if NiPype receives any kind of timeout error (we've seen a few different kinds) from squeue, that it politely waits and tries again.

Actual behavior

RuntimeError: Command:
squeue -j 9448406
Standard output:

Standard error:
slurm_load_jobs error: Socket timed out on send/recv operation
Return code: 1

or

The batch system is not available at the moment.

and NiPype exits

Requested behavior

squeue is busy, will try again

And NiPype does _not_exit

Platform details:

(NiPypeUpdate) [zhoud4@felix ETPB]$ python -c "import nipype; from pprint import pprint; pprint(nipype.get_info())"
{'commit_hash': 'ec7457c23',
 'commit_source': 'installation',
 'networkx_version': '2.2',
 'nibabel_version': '2.3.1',
 'nipype_version': '1.1.3',
 'numpy_version': '1.15.3',
 'pkg_path': '/data/zhoud4/python/envs/NiPypeUpdate/lib/python3.5/site-packages/nipype',
 'scipy_version': '1.1.0',
 'sys_executable': '/data/zhoud4/python/envs/NiPypeUpdate/bin/python',
 'sys_platform': 'linux',
 'sys_version': '3.5.4 | packaged by conda-forge | (default, Aug 10 2017, '
                '01:38:41) \n'
                '[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]',
 'traits_version': '4.6.0'}
(NiPypeUpdate) [zhoud4@felix ETPB]$
(NiPypeUpdate) [zhoud4@biowulf ETPB]$ sinfo -V
slurm 17.02.9
(NiPypeUpdate) [zhoud4@biowulf ETPB]$ 

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions