-
Notifications
You must be signed in to change notification settings - Fork 532
ref: slurm's job status checker #1853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
what not just trap the exception and check that? we don't want to use popen separately because we take care of encoding issues inside commandline. |
Our (little) experience with slurm took us to the same conclusion of @mgxd: it is safer to check on I would suggest using |
also if we are going to refactor this let's bring the cached calls that the sge plugin has. it would be nice to not call squeue and sacct that many times in a polling loop. |
|
dear all, has this issue been solved? thanks |
@fabioboh no, but this is overdue for a revisit - I'll try to get this ready for a future release In the meantime, I would recommend using the |
reposting relevant neurostars post here
|
@mgxd Any interest in trying to squeeze this in for 1.0.1, or keep in future? |
This has been a weird one to debug because I haven't been able to write a workflow that can consistently reproduce the problem - perhaps SLURM load plays a part in this as well. I don't think I'll have the time to revisit this by 1.0.1, so we can leave it in future for now |
Dear all, I'm the one who reported the problem in Neurostars, and I've come back to this now that I'm working on another pipeline again... I tried both nipype 1.0.1 and 1.0.2 on python2 and python3 environment, and still get exactly the same errors on entirely different pipeline... Was this problem fixed on nipype 0.14.0 but not on 1.0.1 and up or..? Let me know if there is anything I can do to check what's going on with the SLURM plugin...! |
@atsuch I don't believe this was fixed in 0.14.0. Could you share a nipype workflow (preferably minimal) that consistently fails for you when using the SLURM plugin? |
@mgxd, It seems to fail no matter what the pipeline is... but yes, let me move my pipeline to a new github repo so I can share it (right now our pipes are in private gitlab repo...). |
@mgxd, This is the repository containing direct copy of the pipeline I had trouble with when using with SLURM plugin. https://github.com/atsuch/nipype_SLURM.git I'm sorry I couldn't include the example data... but I hope you have some sample image to test it out. On the other hand, if you have something you want me to test on our system, please let me know! |
@mgxd, Our problem was fixed when our administrator created a new scratch directory for running our pipeline. It seems that we were encountering the problems I described earlier whenever the job was ran on NFS mounted disk. However, I still get the same error you initially described in the thread; Standard error: whenever I run >50 subjects at a time. Perhaps because my pipeline (different from the one I posted) has many MapNodes, it spawns lots of jobs, and it seems whenever it is above a certain level, I get invalid job id specified error, when in fact the job has completed successfully. I'm not sure if our file system configuration is unusual one...but anyway, I hope it helps debugging the problem. |
We've also been fighting with this issue on the NIH HPC cluster. Are there any updates on the issue @atsuch reported on Jun 22nd? |
@agt24 - is your issue on the cluster also related to nfs mounts? if so can you check if increasing the timeouts help? or is it something else? |
@mgxd - could we add some better slurm queue check support, including testing requeue ids (on our cluster). |
This issue has a couple different issues within, so to outline:
I will submit a patch for the latter - once the new engine is in place, we can work on enhancing the plugins |
@mgxd - for the latter, we could allow CommandLine to not crash on non-zero return. we do that for a few interfaces, where a non-zero return code is allowed. we would just have to check the terminal output to see if indeed the specific error was generated. |
Since finished jobs do not show up on
squeue
, theCommandLine
interface returning a return code 1 would fail, causing the node to crash.Meanwhile...