ref: slurm's job status checker #1853

mgxd · 2017-02-27T16:03:58Z

Since finished jobs do not show up on squeue, the CommandLine interface returning a return code 1 would fail, causing the node to crash.

Traceback: 
Traceback (most recent call last):
  File "/om/user/mathiasg/projects/nipypedev/nipype/nipype/pipeline/plugins/base.py", line 245, in run
    result = self._get_result(taskid)
  File "/om/user/mathiasg/projects/nipypedev/nipype/nipype/pipeline/plugins/base.py", line 518, in _get_result
    if self._is_pending(taskid):
  File "/om/user/mathiasg/projects/nipypedev/nipype/nipype/pipeline/plugins/slurm.py", line 65, in _is_pending
    terminal_output='allatonce').run()
  File "/om/user/mathiasg/projects/nipypedev/nipype/nipype/interfaces/base.py", line 1085, in run
    runtime = self._run_wrapper(runtime)
  File "/om/user/mathiasg/projects/nipypedev/nipype/nipype/interfaces/base.py", line 1728, in _run_wrapper
    runtime = self._run_interface(runtime)
  File "/om/user/mathiasg/projects/nipypedev/nipype/nipype/interfaces/base.py", line 1762, in _run_interface
    self.raise_exception(runtime)
  File "/om/user/mathiasg/projects/nipypedev/nipype/nipype/interfaces/base.py", line 1686, in raise_exception
    **runtime.dictcopy()))
RuntimeError: Command:
squeue -j 6517522
Standard output:

Standard error:
slurm_load_jobs error: Invalid job id specified

Return code: 1
Interface CommandLine failed to run.

Meanwhile...

mg_env[10:59][7.36][-95%]mathiasg@openmind7:logs$ sacct -j 6517522
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
6517522      joiner_no+ om_all_no+     gablab          2  COMPLETED      0:0 
6517522.bat+      batch                gablab          2  COMPLETED      0:0

satra · 2017-02-27T16:21:04Z

what not just trap the exception and check that? we don't want to use popen separately because we take care of encoding issues inside commandline.

oesteban · 2017-02-27T17:22:34Z

Our (little) experience with slurm took us to the same conclusion of @mgxd: it is safer to check on sacct to investigate the exit code of a slurm job (see: https://github.com/poldracklab/crn-app-registration-tool/blob/master/cappat/manager/base.py#L219).

I would suggest using CommandLine as @satra mentioned, but adding a second CommandLine to check the sacct output.

satra · 2017-02-27T17:38:51Z

also if we are going to refactor this let's bring the cached calls that the sge plugin has. it would be nice to not call squeue and sacct that many times in a polling loop.

oesteban · 2017-02-27T18:10:49Z

sacct should be called only once: only after squeue gives an empty output. Agreed on the squeue polling part.

fabioboh · 2017-11-20T14:19:37Z

dear all,

has this issue been solved?

thanks
fabio

mgxd · 2017-11-20T18:41:19Z

@fabioboh no, but this is overdue for a revisit - I'll try to get this ready for a future release

In the meantime, I would recommend using the MultiProc plugin while submitting through SLURM if you are encountering these problems

mgxd · 2018-01-16T23:02:44Z

reposting relevant neurostars post here

I created a small nipype pipeline for preprocessing, and it works when running with MultiProcess plugin.

However, whenever I try to submit the job to our SLURM cluster, I get IO errors at seemingly random nodes, with the following messages;

IOError: Job id (xxxxx) finished or terminated, but results file does not exist after (20.0) seconds. Batch dir contains crashdump file if node raised an exception.

The strange thing is, if I look at the node that supposedly crashed, I can find the output files, and there is no mention of errors or crash in the slurm out file in the batch folder. If I re-submit the workflow, it will randomly crash at some other nodes with the same message.

I’ve set the ‘job_finished_timeout’ to 20 sec, ‘poll_sleep_duration’ to 5sec in the workflow execution cofiguration, hoping that this would help it see the output of each node, but it does not seem to change the frequency of these crashes.

Can someone suggest me what I can try to debug this issue?

I am using nipype 0.13.0 with Python 2.7.

Thank you for your help!

effigies · 2018-02-20T15:26:55Z

@mgxd Any interest in trying to squeeze this in for 1.0.1, or keep in future?

mgxd · 2018-02-20T15:31:52Z

This has been a weird one to debug because I haven't been able to write a workflow that can consistently reproduce the problem - perhaps SLURM load plays a part in this as well.

I don't think I'll have the time to revisit this by 1.0.1, so we can leave it in future for now

atsuch · 2018-04-18T14:20:43Z

Dear all,

I'm the one who reported the problem in Neurostars, and I've come back to this now that I'm working on another pipeline again... I tried both nipype 1.0.1 and 1.0.2 on python2 and python3 environment, and still get exactly the same errors on entirely different pipeline...

Was this problem fixed on nipype 0.14.0 but not on 1.0.1 and up or..? Let me know if there is anything I can do to check what's going on with the SLURM plugin...!

mgxd · 2018-04-18T16:40:28Z

@atsuch I don't believe this was fixed in 0.14.0.

Could you share a nipype workflow (preferably minimal) that consistently fails for you when using the SLURM plugin?

atsuch · 2018-04-18T16:55:20Z

@mgxd, It seems to fail no matter what the pipeline is... but yes, let me move my pipeline to a new github repo so I can share it (right now our pipes are in private gitlab repo...).

atsuch · 2018-04-20T18:24:22Z

@mgxd, This is the repository containing direct copy of the pipeline I had trouble with when using with SLURM plugin.

https://github.com/atsuch/nipype_SLURM.git

I'm sorry I couldn't include the example data... but I hope you have some sample image to test it out. On the other hand, if you have something you want me to test on our system, please let me know!

atsuch · 2018-06-22T13:19:11Z

@mgxd, Our problem was fixed when our administrator created a new scratch directory for running our pipeline. It seems that we were encountering the problems I described earlier whenever the job was ran on NFS mounted disk.

However, I still get the same error you initially described in the thread;

Standard error:
slurm_load_jobs error: Invalid job id specified

whenever I run >50 subjects at a time. Perhaps because my pipeline (different from the one I posted) has many MapNodes, it spawns lots of jobs, and it seems whenever it is above a certain level, I get invalid job id specified error, when in fact the job has completed successfully.

I'm not sure if our file system configuration is unusual one...but anyway, I hope it helps debugging the problem.

agt24 · 2018-09-04T16:39:36Z

We've also been fighting with this issue on the NIH HPC cluster. Are there any updates on the issue @atsuch reported on Jun 22nd?

satra · 2018-09-04T17:20:59Z

@agt24 - is your issue on the cluster also related to nfs mounts? if so can you check if increasing the timeouts help? or is it something else?

satra · 2018-09-04T17:21:32Z

@mgxd - could we add some better slurm queue check support, including testing requeue ids (on our cluster).

mgxd · 2018-09-04T17:42:47Z

This issue has a couple different issues within, so to outline:

Timeouts / incompatibility with some NFS mounted disks
- this can be fixed by specifying a different working directory for the workflow

Checking pending jobs with squeue and Nipype's CommandLine

if the job is not currently running, squeue -j <jobid>, SLURM will exit with the stderr slurm_load_jobs error: Invalid job id specified and a return code of 1. This will halt further execution, since CommandLine will crash if the return code is non-zero

nipype/nipype/interfaces/base/core.py

Lines 1031 to 1033 in 37f3781

    
           if runtime.returncode is None or \ 
        
                   runtime.returncode not in correct_return_codes: 
        
               self.raise_exception(runtime)

. We can work around this by instead checking pending jobs through sacct

I will submit a patch for the latter - once the new engine is in place, we can work on enhancing the plugins

satra · 2018-09-04T17:51:24Z

@mgxd - for the latter, we could allow CommandLine to not crash on non-zero return. we do that for a few interfaces, where a non-zero return code is allowed. we would just have to check the terminal output to see if indeed the specific error was generated.

mgxd added 6 commits February 24, 2017 13:40

fix: auto test

67b9d73

rev: remove commandline interface

87b0df6

Merge branch 'fix/slurm' of github.com:mgxd/nipype into fix/slurm

8683837

rev: stop slurm invalid jobid error

579733f

fix: autotest

c92bfb7

Merge branch 'master' of github.com:nipy/nipype into fix/slurm

ac26163

mgxd changed the title ~~rev: slurm's job status checker~~ ref: slurm's job status checker Feb 27, 2017

mgxd added the in-progress label Mar 31, 2017

mgxd added this to the 0.14.1 milestone Nov 20, 2017

mgxd modified the milestones: 0.14.1, future Dec 19, 2017

mgxd closed this Feb 23, 2018

mgxd added bug help-wanted labels Apr 18, 2018

mgxd deleted the fix/slurm branch September 4, 2018 19:28

mgxd mentioned this pull request Sep 4, 2018

FIX: SLURM plugin polling #2693

Merged

1 task

effigies removed this from the future milestone Sep 28, 2018

ref: slurm's job status checker #1853

ref: slurm's job status checker #1853

Uh oh!

Conversation

mgxd commented Feb 27, 2017

Uh oh!

satra commented Feb 27, 2017

Uh oh!

oesteban commented Feb 27, 2017

Uh oh!

satra commented Feb 27, 2017

Uh oh!

oesteban commented Feb 27, 2017

Uh oh!

fabioboh commented Nov 20, 2017

Uh oh!

mgxd commented Nov 20, 2017

Uh oh!

mgxd commented Jan 16, 2018

Uh oh!

effigies commented Feb 20, 2018

Uh oh!

mgxd commented Feb 20, 2018

Uh oh!

atsuch commented Apr 18, 2018

Uh oh!

mgxd commented Apr 18, 2018

Uh oh!

atsuch commented Apr 18, 2018

Uh oh!

atsuch commented Apr 20, 2018

Uh oh!

atsuch commented Jun 22, 2018

Uh oh!

agt24 commented Sep 4, 2018

Uh oh!

satra commented Sep 4, 2018

Uh oh!

satra commented Sep 4, 2018

Uh oh!

mgxd commented Sep 4, 2018

Uh oh!

satra commented Sep 4, 2018

Uh oh!

Uh oh!