Skip to content

Draft a slurm test#704

Merged
liamhuber merged 5 commits intoexecutorfrom
slurm_test
Jul 18, 2025
Merged

Draft a slurm test#704
liamhuber merged 5 commits intoexecutorfrom
slurm_test

Conversation

@liamhuber
Copy link
Member

based on Jan's work in pyiron/executorlib#726

@github-actions
Copy link

Binder 👈 Launch a binder notebook on branch pyiron/pyiron_workflow/slurm_test

@codecov
Copy link

codecov bot commented Jul 17, 2025

Codecov Report

Attention: Patch coverage is 66.66667% with 1 line in your changes missing coverage. Please review.

Project coverage is 92.02%. Comparing base (0182d3a) to head (0175251).

Files with missing lines Patch % Lines
pyiron_workflow/workflow.py 66.66% 1 Missing ⚠️

❌ Your patch check has failed because the patch coverage (66.66%) is below the target coverage (95.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@             Coverage Diff              @@
##           executor     #704      +/-   ##
============================================
- Coverage     92.10%   92.02%   -0.09%     
============================================
  Files            34       34              
  Lines          3726     3725       -1     
============================================
- Hits           3432     3428       -4     
- Misses          294      297       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@liamhuber
Copy link
Member Author

 Traceback (most recent call last):
  File "/home/runner/work/pyiron_workflow/pyiron_workflow/cached-miniforge/my-env/lib/python3.12/site-packages/pyiron_snippets/dotdict.py", line 4, in __getattr__
    return self.__getitem__(item)
           ^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'add_done_callback'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/runner/work/pyiron_workflow/pyiron_workflow/tests/cluster/slurm_discovery.py", line 13, in <module>
    wf.run_in_thread()
  File "/home/runner/work/pyiron_workflow/pyiron_workflow/cached-miniforge/my-env/lib/python3.12/site-packages/pyiron_workflow/workflow.py", line 404, in run_in_thread
    f.add_done_callback(lambda _: setattr(self, "executor", None))
    ^^^^^^^^^^^^^^^^^^^
  File "/home/runner/work/pyiron_workflow/pyiron_workflow/cached-miniforge/my-env/lib/python3.12/site-packages/pyiron_snippets/dotdict.py", line 6, in __getattr__
    raise AttributeError(
AttributeError: DotDict object has no attribute 'add_done_callback'

Interesting, the internal run call is resolving without returning a future, it's just giving the output dotdict directly.

@liamhuber
Copy link
Member Author

Interesting, the internal run call is resolving without returning a future, it's just giving the output dotdict directly.

Because it was accessing a cached result. Cf. #707

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
Otherwise we run into trouble where it loads saved executor instructions (that already have what it would use anyhow)

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
@liamhuber liamhuber marked this pull request as ready for review July 18, 2025 17:17
Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
@liamhuber liamhuber merged commit f579159 into executor Jul 18, 2025
18 checks passed
@liamhuber liamhuber deleted the slurm_test branch July 18, 2025 17:53
liamhuber added a commit that referenced this pull request Jul 18, 2025
* Wrap executorlib executors

To exploit the new caching interface so that nodes can rely on their running state and lexical path to access previously executed results.

Locally with the SingleNodeExecutor everything is looking good, but that doesn't natively support terminating the process that submit the job. I'd like to play around with this using the SlurmClusterExecutor on the cluster before making further changes.

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Be kinder to fstrings

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Remove prints

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Black

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Bump lower bound of executorlib

Since we now depend explicitly on a new feature

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Wrap SlurmClusterExecutor

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Don't re-parse executor tuples

It causes a weird hang that blocks observability.

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Exploit lexical path cleaning

And make the expected file independently accessible

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Move cache cleaning into the finally method

And hide it behind its own boolean flag for testing

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Update executorlib syntax

And make the file name fixed and accessible at the class level

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Test the single node executor

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Clean the associated cache subdirectory

The slurm executor populates this with a submission script, etc.

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Clean up the slurm stuff too

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Add local file executor

From @jan-janssen in [this comment](pyiron/executorlib#708 (comment))

Co-authored-by: Jan Janssen
Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* lint

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Test local both executors

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Add prefix to cleaning directory

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Use test executor

The local file executor got directly included in executorlib as a testing tool.

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Validate executor at assignment

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Delay executor tuple parsing

And always with-execute tuples since there is only ever one instance of this executor. If we have already been assigned an executor _instance_ then we trust the user to be managing its state and submit directly rather than wrapping in a with-clause

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Decrease improvement expectation

Recent changes threw off the balance of times in the first vs second run, so rather compare to what you actually care about: that the second run is bypassing the sleep call.

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Use explicit shutdown

Instead of a with-clause. This way the executor is still permitted to release the thread before the job is done, but we still guarantee that executors created by bespoke instructions get shutdown at the end of their one-future lifetime.

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Clean up written file

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Don't wait

There was necessarily only the one future, so don't wait at shutdown. This removes the need for accepting the runtime error and prevents the wrapped executorlib executors from hanging indefinitely.

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Bump executorlib version

Including the lower bound

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Test application to non-node

And debug the error message

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Validate resources at init too

Since that's the way users will typically interact with this field. I also had to change the inheritance order to make sure we were dealing with the user-facing executor and not the task scheduler, but this doesn't impact the submit loop.

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Test file cleaning

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Test uninterpretable executor setting

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Modify test for coverage

So we pass throught the Runnable._shutdown_executor_callback process

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Decrease lower bound

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Draft a slurm test (#704)

* Test slurm submission

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Don't apply callbacks to cached returns

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Only validate submission-time resources

Otherwise we run into trouble where it loads saved executor instructions (that already have what it would use anyhow)

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Mark module

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

* Test cached result branch

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

---------

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>

---------

Signed-off-by: liamhuber <liamhuber@greyhavensolutions.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant