-
Notifications
You must be signed in to change notification settings - Fork 30
example/stop_ensemble #458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
* fix file naming bug * fix filename bug with variable as study name * add tests for the file name special vars changes * modify changelog * implement Luc's suggestions * remove replace line
* first attempt at adding pdf * fixing build error * modify changelog to show docs changes * fix errors Luc found in the build logs * trying out removal of latex * reverting latex changes back * uncommenting the latex_elements settings * adding epub to see if latex will build * adding a latex engine variable to conf * fix naming error with latex_engine * attempting to add a logo to the pdf build * testing an override to the searchtools file * revert back to not using searchtools override * update changelog
* fix openfoam_singularity issues * update requirements and descriptions for openfoam examples
* fix bug with output_path and variable substitution * add tests for cli substitutions
* bump version to 1.10.2 * bump version in CHANGELOG
* fix default worker bug with all steps * version bump and requirements fix
* Bump certifi from 2022.12.7 to 2023.7.22 in /docs Bumps [certifi](https://github.com/certifi/python-certifi) from 2022.12.7 to 2023.7.22. - [Commits](certifi/python-certifi@2022.12.07...2023.07.22) --- updated-dependencies: - dependency-name: certifi dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * add all dependabot changes and update CHANGELOG --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* add *.conf to the MANIFEST file so pip will grab the redis.conf file * add note explaining how to fix a hanging merlin server start * modify CHANGELOG * add second export option to docs and fix typo
* Version/1.10.3 (LLNL#445) * fix default worker bug with all steps * version bump and requirements fix * Bugfix/filename-special-vars (LLNL#425) * fix file naming bug * fix filename bug with variable as study name * add tests for the file name special vars changes * modify changelog * implement Luc's suggestions * remove replace line * Create dependabot-changelog-updater.yml * testing outputs of modifying changelog * delete dependabot-changelog-updater * feature/pdf-docs (LLNL#427) * first attempt at adding pdf * fixing build error * modify changelog to show docs changes * fix errors Luc found in the build logs * trying out removal of latex * reverting latex changes back * uncommenting the latex_elements settings * adding epub to see if latex will build * adding a latex engine variable to conf * fix naming error with latex_engine * attempting to add a logo to the pdf build * testing an override to the searchtools file * revert back to not using searchtools override * update changelog * bugfix/openfoam_singularity_issues (LLNL#426) * fix openfoam_singularity issues * update requirements and descriptions for openfoam examples * bugfix/output-path-substitution (LLNL#430) * fix bug with output_path and variable substitution * add tests for cli substitutions * bugfix/scheduler-permission-error (LLNL#436) * Release/1.10.2 (LLNL#437) * bump version to 1.10.2 * bump version in CHANGELOG * resolve develop to main merge issues (LLNL#439) * fix default worker bug with all steps * version bump and requirements fix * dependabot/certifi-requests-pygments (LLNL#441) * Bump certifi from 2022.12.7 to 2023.7.22 in /docs Bumps [certifi](https://github.com/certifi/python-certifi) from 2022.12.7 to 2023.7.22. - [Commits](certifi/python-certifi@2022.12.07...2023.07.22) --- updated-dependencies: - dependency-name: certifi dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * add all dependabot changes and update CHANGELOG --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * bugfix/server-pip-redis-conf (LLNL#443) * add *.conf to the MANIFEST file so pip will grab the redis.conf file * add note explaining how to fix a hanging merlin server start * modify CHANGELOG * add second export option to docs and fix typo * bump to version 1.10.3 (LLNL#444) --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * change hardcoded sphinx requirement * update CHANGELOG --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* fix file naming error for iterative workflows * fixed small bug with new filepath naming * add VLAUNCHER functionality * add docs for VLAUNCHER and modify changelog * re-word docs and fix table format * add a test for vlauncher * run fix-style and add a test for vlauncher * Add the find_vlaunch_var and setup_vlaunch functions. The numeric value of the shell variables may not be defined until run time, so replace with variable strings instead of values. Consolidate the commands into one function. * Add variable set for (t)csh. * Run fix-style * make step settings the defaults and ignore commented lines * add some additional tests * remove regex library import --------- Co-authored-by: Joseph M. Koning <koning1@llnl.gov>
* add patch for skewed sample hierarchy/additional samples * update changelog * catch narrower range of exceptions
* fix typo in batch.py that causes a bug * change print statements to log statements
* begin work on integration refactor; create fixtures and initial tests * update CHANGELOG and run fix-style * add pytest fixtures and README explaining them * add tests to demonstrate how to use the fixtures * move/rename some files and modify integration's README * add password change to redis.pass file * fix lint issues * modify redis pwd for test server to be constant for each test * fix lint issue only caught on github ci
* begin work on integration refactor; create fixtures and initial tests * update CHANGELOG and run fix-style * add pytest fixtures and README explaining them * add tests to demonstrate how to use the fixtures * move/rename some files and modify integration's README * add password change to redis.pass file * fix lint issues * modify redis pwd for test server to be constant for each test * fix lint issue only caught on github ci * add fix for merlin server startup * update CHANGELOG
…elps to alleviate the need to constantly check an ensemble to stop the workers when the workflow finishes. See README file for further details.
…show job <jobid> doesn't show up anymore. This was remedied by checking the specification filename printed at the top of the output file. This also allowed for improved robustness by doing a whole match between the and the WorkDir set in scontrol show job <jobid>.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions about this:
How can/should it be modified now that merlin monitor works?
Can/should it use exit $(MERLIN_STOP_WORKERS) to avoid the worker-killed but task not completed issue?
We should note this is only for slurm based systems
Is there an alternative way to do this, eg via modifying $(MERLIN_STOP_WORKERS) logic to cancel a worker’s batch job and any dependencies instead of just the worker? It might be more elegant than an example.
@bgunnar5 what do you think about creating a custom stop worker command that also shuts down a worker’s batch job (and any dependencies)? We could register a custom command that workers can listen for https://docs.celeryq.dev/en/latest/userguide/workers.html#id18 This would mean that $(MERLIN_STOP_WORKERS) could shut down batch jobs too and make it easy for any workflow to shut down its workers and allocation at the end |
@lucpeterson yeah I think ultimately baking this into Merlin rather than having it as a standalone example is probably the way to go. This also means that I could expand it to work for more than just slurm too. I'm not sure if this would need to be an entirely new custom command or if we could just add it on to the existing |
Yeah the stop workers command issues a broadcast call to shutdown workers remotely. If we made a new command that we could broadcast, this would slip in nicely. You could kill workers and batch jobs from either the command line or a workflow. |
@bgunnar5 @lucpeterson, I personally think that having a new command in the form of something along the lines |
@bgunnar5 Let me know if you would like some help in accomplishing this. I can work on this a little at a time if needed. |
Ideally we could be able to do something like We'll need to figure out how to register this custom command based on the scheduler so there will likely be some work to do within the script adapters themselves. Maybe each script adapter can have its own If you have time to work on it then I definitely won't stop you! |
@bgunnar5 Thanks for pointing me to the right locations of where |
A new example for having the workflow itself stop the workers assigned to it.