Skip to content

[patch] Persistent process #476

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Sep 29, 2024
Merged

[patch] Persistent process #476

merged 19 commits into from
Sep 29, 2024

Conversation

liamhuber
Copy link
Member

@liamhuber liamhuber commented Sep 25, 2024

To work with long-duration nodes on executors that survive the shutdown of the parent workflow/node python process (e.g. executorlib using slurm), we need to be able to tell the run paradigm to serialize the results, and to try to load such a serialization if we come back and the node is running.

This introduces new attributes Node.serialize_results to trigger the result serialization, and a private Node._do_clean to let power users (i.e. me writing the unit tests) stop the serialized results (and any empty directories) from getting cleaned up automatically at read-time.

Under the hood, Node now directly implements Runnable.on_run and Runnable.run_args leveraging the new detached path from #457 to make sure that each run has access to a semantically relevant path for writing the temporary output file (using cloudpickle). Child classes of Node implement new abstract methods Node._on_run and Node._run_args in place of the previous Runnable abstract methods they implemented.

Still needs work with saving and reloading the parent node. E.g. it will presumably also be re-loaded in the "running" state, but we'd like it to be pretty easy to keep going -- maybe Composite.resume()?

EDIT:

Instead of manually saving a checkpoint, this now just leans on the recovery file getting written when the parent python process gets shut down. There's also no hand-holding around updating the failed status or cache usage of such shut-down nodes.

Overall I'm a little sad about the added layer of misdirection where Runnable.on_run is abstract and implemented by Node.on_run to handle the result serialization, then Node._on_run is a new abstract... but it is verbose more than complex, so I can grumpily live with it.

Tests leverage the flags manually to spoof the behaviour, but a live test in a read-only nodebook shows how the basic operation is working.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@liamhuber liamhuber added the format_black trigger the Black formatting bot label Sep 25, 2024
@liamhuber liamhuber mentioned this pull request Sep 25, 2024
3 tasks
Copy link

Binder 👈 Launch a binder notebook on branch pyiron/pyiron_workflow/persistent_process

Copy link

codacy-production bot commented Sep 25, 2024

Coverage summary from Codacy

See diff coverage on Codacy

Coverage variation Diff coverage
+0.04% (target: -1.00%) 95.95%
Coverage variation details
Coverable lines Covered lines Coverage
Common ancestor commit (e4a066c) 3319 3039 91.56%
Head commit (5e2bb99) 3371 (+52) 3088 (+49) 91.60% (+0.04%)

Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: <coverage of head commit> - <coverage of common ancestor commit>

Diff coverage details
Coverable lines Covered lines Diff coverage
Pull request (#476) 74 71 95.95%

Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: <covered lines added or modified>/<coverable lines added or modified> * 100%

See your quality gate settings    Change summary preferences

Codacy stopped sending the deprecated coverage status on June 5th, 2024. Learn more

@coveralls
Copy link

coveralls commented Sep 25, 2024

Pull Request Test Coverage Report for Build 11096898017

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 37 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+0.04%) to 91.605%

Files with Coverage Reduction New Missed Lines %
nodes/composite.py 13 92.49%
node.py 24 90.94%
Totals Coverage Status
Change from base Build 11095845803: 0.04%
Covered Lines: 3088
Relevant Lines: 3371

💛 - Coveralls

@liamhuber liamhuber removed the format_black trigger the Black formatting bot label Sep 26, 2024
@liamhuber liamhuber mentioned this pull request Sep 26, 2024
@liamhuber
Copy link
Member Author

When I went to actually test this, I discovered that executorlib (at least how I'm running it!) is killing the slurm jobs when the Executor dies, so they are not persistent after all. It's possible this is merely user error on my end and there's a way to flag the jobs as persistent (corresponding issue pyiron/executorlib#412), and if so this will still be immediately useful.

Otherwise, there is no handy-dandy persistent-job executor floating around, in which case there's nothing fundamentally wrong with the work here, but it would be a lot of complication for no immediate benefit and I would rather close it.

liamhuber and others added 9 commits September 26, 2024 15:34
By adding a new flag, having `Node.on_run` directly handle the serialization of results (making `_on_run` and `_run_args` new abstract methods that have the behaviour of the old public methods), and deserialize temporary results instead of running when a node is already running and such results exist
So that the graph gets saved with the serializer in a running state
@liamhuber
Copy link
Member Author

Ok, it turns out the way forward with executorlib is it's FileExecutor and cache module, as shown here: https://github.com/pyiron-dev/remote-executor

This is a complex enough interface that I don't have time to explore it now. I think this is probably still the way forward for submitting an individual node in a graph off to slurm, and I think you still need the last-minute serialization introduced in this PR, so I'm going to leave this open but draft. In the meantime submitting entire graphs to slurm at once is working fine.

It's a bit messier for the filesystem, but for now let's default to keeping the data around
I want this functionality in, but I'm not at all happy with the UI, and don't totally trust edge cases (e.g. input changing under our feet), so let's put it in private for now in anticipation of changes
We'll get a recovery file when we close the parent process anyhow
With a real living example of `Node._serialize_result` working
@liamhuber liamhuber changed the title WIP: Persistent process [patch] Persistent process Sep 29, 2024
@liamhuber liamhuber marked this pull request as ready for review September 29, 2024 23:52
@liamhuber
Copy link
Member Author

I tried running the pysqa+executorlib.FileExecutor example as-written on cmmc and it just hung. This is almost certainly something simple like a dependency mismatch compared to the binder env, but TBH I don't even want to fight with right now, so instead I took five minutes to make a laughably bad child of concurrent.futures.Executor that runs something independent of the parent python process and used that. The HPC_example.ipynb language has been updated to further clarify that the examples there are proofs-of-concept and not intended as real or long term interfaces, which should certainly use pysqa and may use FileExecutor.

@liamhuber liamhuber merged commit c57d387 into main Sep 29, 2024
16 of 17 checks passed
@liamhuber liamhuber deleted the persistent_process branch September 29, 2024 23:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants