-
Notifications
You must be signed in to change notification settings - Fork 132
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add machine defined limits on maximum processors and batch wall clock time #349
Conversation
This addresses Issue #330 |
The new documentation, largely a table, can be seen here, https://apcraig-cice.readthedocs.io/en/machlim/user_guide/ug_running.html#machine-variables |
This looks good to me. I'd like @phil-blain to review and test the changes on his (limited) machine, and I'll try to do the same on badger. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried it on brooks and the MAXRUNLENGTH settings works correctly.
Regarding the MAXPES option, I'm not sure I understand yet how everything works regarding blocks, max_blocks and everything, and how cice.setup interacts with the code at the top of cice.batch in that regard, but just looking at the mods in cice.setup in this PR, the first question that came to me was why do we favor keeping ${thrd}
as was asked by the user and changing ${task}
, and not the opposite ? (or should we allow for the user to choose which of ntask or nthreads should be overwritten if MAXPES < ntasks x nthreads.. )
I guess that is what you mean by
adjusting both tasks and threads if the MAXPES is violated
Also,
adding other machine constraints such as a max threads per task or max threads per node
I think this is a very good idea but maybe not necessary in the context of this PR.
The choice to reduce tasks and increase max blocks while keeping threads the same is largely arbitrary, but I think it's probably the simplest and most robust. In many cases, threads will be 1 so the only option is to reduce tasks. In our test suites, if threads are 2, we really do not want to reduce them to 1 as we'll be turning off threading, which is probably not what we want for that test. Also, threads can be oversubscribed without too much problem, but tasks really cannot. I suppose some additional logic could be added to reduce threads but we'd need to come up with appropriate rules to do so. Also, these checks are really designed to overcome setting up test suites which have somewhat arbitrary pe counts to cover various configurations. All we really need is an adjustment to the task/threads for a small subset of machines that are resource limited and for it to work and cover the various configurations. Reducing tasks seems to make sense for that. If someone is setting up a test or a case on their machine and oversubscribing their machine and they don't like what the scripts are doing, it's really their fault. Maybe we should have cice.setup abort and force the user to fix it in that case? Do you think that's a useful feature to add? Anyway, that's my thinking about the current implementation. If think just reducing tasks addresses the issue, and I haven't been able to think of a case that is adversely affected by using that approach. But if there are cases where just reducing tasks is a problem, it would be good to take them into account. I guess one would be if the number of tasks is reduced to 1 while threads remain high, but I am not sure that is a likely scenario. It really depends on the resources available. For machines like travis, with very limited resources, we have a special test suite. If someone is trying to run the full test suite on their laptop, that's another case we might want to think about and generate a special test suite. For machines that may have only 16 pes or something, I think the approach taken here is reasonable. For any machine that has over 32 or 64 pes, it's likely to have enough resources. I'm open to other suggestions though. |
Thanks for the explanation. I agree; the current implementation solves the problem at hand. We can always add more features if they are needed in the future. |
Anyone have any objections to merging this now? I will merge tomorrow (Friday, Aug 22) unless I hear otherwise. |
PR checklist
Add machine defined limits on maximum processors and batch wall clock time
apcraig
#2253cca91c8bb7e at https://github.com/CICE-Consortium/Test-Results/wiki/cice_by_hash_forks
Two new optional machine env variables were added,
ICE_MACHINE_MAXPES
ICE_MACHINE_MAXRUNLENGTH
This is a relatively quick fix to allow test suites and cases to run on machines with limited processors or batch wall times. These constraints will be imposed on the case as follows,
A full test suite was run on conrad with two compilers. Manual testing was also done on conrad imposing limits on the pes and time limit to make sure the script was working as designed.
There are some other issues that could be tackled such as