-
Notifications
You must be signed in to change notification settings - Fork 2
[patch] Persistent process #476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Coverage summary from CodacySee diff coverage on Codacy
Coverage variation details
Coverage variation is the difference between the coverage for the head and common ancestor commits of the pull request branch: Diff coverage details
Diff coverage is the percentage of lines that are covered by tests out of the coverable lines that the pull request added or modified: See your quality gate settings Change summary preferencesCodacy stopped sending the deprecated coverage status on June 5th, 2024. Learn more |
Pull Request Test Coverage Report for Build 11096898017Details
💛 - Coveralls |
When I went to actually test this, I discovered that Otherwise, there is no handy-dandy persistent-job executor floating around, in which case there's nothing fundamentally wrong with the work here, but it would be a lot of complication for no immediate benefit and I would rather close it. |
By adding a new flag, having `Node.on_run` directly handle the serialization of results (making `_on_run` and `_run_args` new abstract methods that have the behaviour of the old public methods), and deserialize temporary results instead of running when a node is already running and such results exist
So that the graph gets saved with the serializer in a running state
827c7ee
to
ac0a75b
Compare
Ok, it turns out the way forward with This is a complex enough interface that I don't have time to explore it now. I think this is probably still the way forward for submitting an individual node in a graph off to slurm, and I think you still need the last-minute serialization introduced in this PR, so I'm going to leave this open but draft. In the meantime submitting entire graphs to slurm at once is working fine. |
It's a bit messier for the filesystem, but for now let's default to keeping the data around
I want this functionality in, but I'm not at all happy with the UI, and don't totally trust edge cases (e.g. input changing under our feet), so let's put it in private for now in anticipation of changes
We'll get a recovery file when we close the parent process anyhow
With a real living example of `Node._serialize_result` working
I tried running the |
To work with long-duration nodes on executors that survive the shutdown of the parent workflow/node python process (e.g.
executorlib
using slurm), we need to be able to tell the run paradigm to serialize the results, and to try to load such a serialization if we come back and the node is running.This introduces new attributes
Node.serialize_results
to trigger the result serialization, and a privateNode._do_clean
to let power users (i.e. me writing the unit tests) stop the serialized results (and any empty directories) from getting cleaned up automatically at read-time.Under the hood,
Node
now directly implementsRunnable.on_run
andRunnable.run_args
leveraging the new detached path from #457 to make sure that each run has access to a semantically relevant path for writing the temporary output file (using cloudpickle). Child classes ofNode
implement new abstract methodsNode._on_run
andNode._run_args
in place of the previousRunnable
abstract methods they implemented.Still needs work with saving and reloading the parent node. E.g. it will presumably also be re-loaded in the "running" state, but we'd like it to be pretty easy to keep going -- maybe
Composite.resume()
?EDIT:
Instead of manually saving a checkpoint, this now just leans on the recovery file getting written when the parent python process gets shut down. There's also no hand-holding around updating the failed status or cache usage of such shut-down nodes.
Overall I'm a little sad about the added layer of misdirection where
Runnable.on_run
is abstract and implemented byNode.on_run
to handle the result serialization, thenNode._on_run
is a new abstract... but it is verbose more than complex, so I can grumpily live with it.Tests leverage the flags manually to spoof the behaviour, but a live test in a read-only nodebook shows how the basic operation is working.