Skip to content

ref: slurm's job status checker #1853

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from
Closed

ref: slurm's job status checker #1853

wants to merge 6 commits into from

Conversation

mgxd
Copy link
Member

@mgxd mgxd commented Feb 27, 2017

Since finished jobs do not show up on squeue, the CommandLine interface returning a return code 1 would fail, causing the node to crash.

Traceback: 
Traceback (most recent call last):
  File "/om/user/mathiasg/projects/nipypedev/nipype/nipype/pipeline/plugins/base.py", line 245, in run
    result = self._get_result(taskid)
  File "/om/user/mathiasg/projects/nipypedev/nipype/nipype/pipeline/plugins/base.py", line 518, in _get_result
    if self._is_pending(taskid):
  File "/om/user/mathiasg/projects/nipypedev/nipype/nipype/pipeline/plugins/slurm.py", line 65, in _is_pending
    terminal_output='allatonce').run()
  File "/om/user/mathiasg/projects/nipypedev/nipype/nipype/interfaces/base.py", line 1085, in run
    runtime = self._run_wrapper(runtime)
  File "/om/user/mathiasg/projects/nipypedev/nipype/nipype/interfaces/base.py", line 1728, in _run_wrapper
    runtime = self._run_interface(runtime)
  File "/om/user/mathiasg/projects/nipypedev/nipype/nipype/interfaces/base.py", line 1762, in _run_interface
    self.raise_exception(runtime)
  File "/om/user/mathiasg/projects/nipypedev/nipype/nipype/interfaces/base.py", line 1686, in raise_exception
    **runtime.dictcopy()))
RuntimeError: Command:
squeue -j 6517522
Standard output:

Standard error:
slurm_load_jobs error: Invalid job id specified

Return code: 1
Interface CommandLine failed to run.

Meanwhile...

mg_env[10:59][7.36][-95%]mathiasg@openmind7:logs$ sacct -j 6517522
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
6517522      joiner_no+ om_all_no+     gablab          2  COMPLETED      0:0 
6517522.bat+      batch                gablab          2  COMPLETED      0:0 

@satra
Copy link
Member

satra commented Feb 27, 2017

what not just trap the exception and check that? we don't want to use popen separately because we take care of encoding issues inside commandline.

@oesteban
Copy link
Contributor

Our (little) experience with slurm took us to the same conclusion of @mgxd: it is safer to check on sacct to investigate the exit code of a slurm job (see: https://github.com/poldracklab/crn-app-registration-tool/blob/master/cappat/manager/base.py#L219).

I would suggest using CommandLine as @satra mentioned, but adding a second CommandLine to check the sacct output.

@satra
Copy link
Member

satra commented Feb 27, 2017

also if we are going to refactor this let's bring the cached calls that the sge plugin has. it would be nice to not call squeue and sacct that many times in a polling loop.

@oesteban
Copy link
Contributor

sacct should be called only once: only after squeue gives an empty output. Agreed on the squeue polling part.

@mgxd mgxd changed the title rev: slurm's job status checker ref: slurm's job status checker Feb 27, 2017
@fabioboh
Copy link
Contributor

dear all,

has this issue been solved?

thanks
fabio

@mgxd
Copy link
Member Author

mgxd commented Nov 20, 2017

@fabioboh no, but this is overdue for a revisit - I'll try to get this ready for a future release

In the meantime, I would recommend using the MultiProc plugin while submitting through SLURM if you are encountering these problems

@mgxd mgxd added this to the 0.14.1 milestone Nov 20, 2017
@mgxd mgxd modified the milestones: 0.14.1, future Dec 19, 2017
@mgxd
Copy link
Member Author

mgxd commented Jan 16, 2018

reposting relevant neurostars post here

I created a small nipype pipeline for preprocessing, and it works when running with MultiProcess plugin.

However, whenever I try to submit the job to our SLURM cluster, I get IO errors at seemingly random nodes, with the following messages;

IOError: Job id (xxxxx) finished or terminated, but results file does not exist after (20.0) seconds. Batch dir contains crashdump file if node raised an exception.

The strange thing is, if I look at the node that supposedly crashed, I can find the output files, and there is no mention of errors or crash in the slurm out file in the batch folder. If I re-submit the workflow, it will randomly crash at some other nodes with the same message.

I’ve set the ‘job_finished_timeout’ to 20 sec, ‘poll_sleep_duration’ to 5sec in the workflow execution cofiguration, hoping that this would help it see the output of each node, but it does not seem to change the frequency of these crashes.

Can someone suggest me what I can try to debug this issue?

I am using nipype 0.13.0 with Python 2.7.

Thank you for your help!

@effigies
Copy link
Member

@mgxd Any interest in trying to squeeze this in for 1.0.1, or keep in future?

@mgxd
Copy link
Member Author

mgxd commented Feb 20, 2018

This has been a weird one to debug because I haven't been able to write a workflow that can consistently reproduce the problem - perhaps SLURM load plays a part in this as well.

I don't think I'll have the time to revisit this by 1.0.1, so we can leave it in future for now

@mgxd mgxd closed this Feb 23, 2018
@atsuch
Copy link

atsuch commented Apr 18, 2018

Dear all,

I'm the one who reported the problem in Neurostars, and I've come back to this now that I'm working on another pipeline again... I tried both nipype 1.0.1 and 1.0.2 on python2 and python3 environment, and still get exactly the same errors on entirely different pipeline...

Was this problem fixed on nipype 0.14.0 but not on 1.0.1 and up or..? Let me know if there is anything I can do to check what's going on with the SLURM plugin...!

@mgxd
Copy link
Member Author

mgxd commented Apr 18, 2018

@atsuch I don't believe this was fixed in 0.14.0.

Could you share a nipype workflow (preferably minimal) that consistently fails for you when using the SLURM plugin?

@atsuch
Copy link

atsuch commented Apr 18, 2018

@mgxd, It seems to fail no matter what the pipeline is... but yes, let me move my pipeline to a new github repo so I can share it (right now our pipes are in private gitlab repo...).

@atsuch
Copy link

atsuch commented Apr 20, 2018

@mgxd, This is the repository containing direct copy of the pipeline I had trouble with when using with SLURM plugin.

https://github.com/atsuch/nipype_SLURM.git

I'm sorry I couldn't include the example data... but I hope you have some sample image to test it out. On the other hand, if you have something you want me to test on our system, please let me know!

@atsuch
Copy link

atsuch commented Jun 22, 2018

@mgxd, Our problem was fixed when our administrator created a new scratch directory for running our pipeline. It seems that we were encountering the problems I described earlier whenever the job was ran on NFS mounted disk.

However, I still get the same error you initially described in the thread;

Standard error:
slurm_load_jobs error: Invalid job id specified

whenever I run >50 subjects at a time. Perhaps because my pipeline (different from the one I posted) has many MapNodes, it spawns lots of jobs, and it seems whenever it is above a certain level, I get invalid job id specified error, when in fact the job has completed successfully.

I'm not sure if our file system configuration is unusual one...but anyway, I hope it helps debugging the problem.

@agt24
Copy link

agt24 commented Sep 4, 2018

We've also been fighting with this issue on the NIH HPC cluster. Are there any updates on the issue @atsuch reported on Jun 22nd?

@satra
Copy link
Member

satra commented Sep 4, 2018

@agt24 - is your issue on the cluster also related to nfs mounts? if so can you check if increasing the timeouts help? or is it something else?

@satra
Copy link
Member

satra commented Sep 4, 2018

@mgxd - could we add some better slurm queue check support, including testing requeue ids (on our cluster).

@mgxd
Copy link
Member Author

mgxd commented Sep 4, 2018

This issue has a couple different issues within, so to outline:

  • Timeouts / incompatibility with some NFS mounted disks
    • this can be fixed by specifying a different working directory for the workflow
  • Checking pending jobs with squeue and Nipype's CommandLine
    • if the job is not currently running, squeue -j <jobid>, SLURM will exit with the stderr slurm_load_jobs error: Invalid job id specified and a return code of 1. This will halt further execution, since CommandLine will crash if the return code is non-zero
      if runtime.returncode is None or \
      runtime.returncode not in correct_return_codes:
      self.raise_exception(runtime)
      . We can work around this by instead checking pending jobs through sacct

I will submit a patch for the latter - once the new engine is in place, we can work on enhancing the plugins

@satra
Copy link
Member

satra commented Sep 4, 2018

@mgxd - for the latter, we could allow CommandLine to not crash on non-zero return. we do that for a few interfaces, where a non-zero return code is allowed. we would just have to check the terminal output to see if indeed the specific error was generated.

@mgxd mgxd deleted the fix/slurm branch September 4, 2018 19:28
@mgxd mgxd mentioned this pull request Sep 4, 2018
1 task
@effigies effigies removed this from the future milestone Sep 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants