-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
balance load over nodes #5077
Comments
Is there anything particular to the nodes not being assigned? Like, are they partitioned differently, or are fulfilling a GPU requirement? Or are all nodes created equal? Toil does try to detect the overhead of the machines but might not be aware of some intensive background task either. I'll look into some of the options for Slurm scaling. That sounds odd to me. |
Yes, all nodes and weights are created equally. I'm wondering if LLN=YES as an argument to slurm is what I want.. |
You can use the If your Slurm jobs are getting OOM-killed, are you sure that your memory limits assigned to your jobs in Cactus are accurate? If they are too low, Slurm I think should detect that you are trying to go over them and OOM-kill your jobs, even if there is free memory on the node not allocated to any jobs. |
It looks like If your Toil jobs are large enough, you can add the |
Using toil from within cactus and a slurm scheduled.
I have 10 nodes available to me and each of them has 40 cores and 500Gb RAM. If I submit 100 jobs, TOIL will submit the jobs to 3 nodes. In my particular case, this is causing oom-kill issues. Is there a way to balance the load - to submit 100 jobs spread evenly over the 10 available nodes?
Thanks in advance for any help available.
┆Issue is synchronized with this Jira Story
┆Issue Number: TOIL-1638
The text was updated successfully, but these errors were encountered: