Closed
Description
Summary
At end of issue #2693 @effigies noted that the error that @dalejn was experiencing was due to the SLURM master throwing an error when it was polled with squeue, possibly because it was busy. After some further testing, we now believe that the NIH HPC SLURM master will throw this error at least once a day even with a modest polling interval.
We would like to request a patch such that if NiPype receives any kind of timeout error (we've seen a few different kinds) from squeue, that it politely waits and tries again.
Actual behavior
RuntimeError: Command:
squeue -j 9448406
Standard output:
Standard error:
slurm_load_jobs error: Socket timed out on send/recv operation
Return code: 1
or
The batch system is not available at the moment.
and NiPype exits
Requested behavior
squeue is busy, will try again
And NiPype does _not_exit
Platform details:
(NiPypeUpdate) [zhoud4@felix ETPB]$ python -c "import nipype; from pprint import pprint; pprint(nipype.get_info())"
{'commit_hash': 'ec7457c23',
'commit_source': 'installation',
'networkx_version': '2.2',
'nibabel_version': '2.3.1',
'nipype_version': '1.1.3',
'numpy_version': '1.15.3',
'pkg_path': '/data/zhoud4/python/envs/NiPypeUpdate/lib/python3.5/site-packages/nipype',
'scipy_version': '1.1.0',
'sys_executable': '/data/zhoud4/python/envs/NiPypeUpdate/bin/python',
'sys_platform': 'linux',
'sys_version': '3.5.4 | packaged by conda-forge | (default, Aug 10 2017, '
'01:38:41) \n'
'[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]',
'traits_version': '4.6.0'}
(NiPypeUpdate) [zhoud4@felix ETPB]$
(NiPypeUpdate) [zhoud4@biowulf ETPB]$ sinfo -V
slurm 17.02.9
(NiPypeUpdate) [zhoud4@biowulf ETPB]$