Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm runs can get so big that we can't list all the job IDs on the sacct command line and crash #5064

Open
adamnovak opened this issue Aug 19, 2024 · 0 comments

Comments

@adamnovak
Copy link
Member

adamnovak commented Aug 19, 2024

This was reported in ComparativeGenomicsToolkit/cactus#1465

We attempt to put too many job IDs into an sacct command, and we get an argument list too long error. That's not among the ones that prompt us to fall back to scontrol here:

So the net result is that Toil crashes.

We should set up to run multiple sacct commands when we have too many jobs for one, and if weird stuff goes wrong with launching sacct in general (not just if we launch it and it fails) we should fall back to scontrol.

┆Issue is synchronized with this Jira Story
┆Issue Number: TOIL-1632

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants