-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve resume suggestions for SequentialRunner #3026
Improve resume suggestions for SequentialRunner #3026
Conversation
- consider parameters persistent - make --from-node suggestions less conservative - extend tests Signed-off-by: Ondrej Zacha <ondrej.zacha@okra.ai>
@ondrejzacha Thanks a lot for the PR! I am glad that someone is using this for interactive workflow and get it a try. Please let me know if you have more suggestion. IMO the session constrains is sometimes too rigid for interactive purpose, but we need to gather more feedback before we come up with any implementation. Sorry for the slow response, at the moment the team is very busy at something else. I have a quick look at the PR. It makes sense and I agree parameters should be treated as persisted. I'd suggest not to change the original test but add a new one. It helps to understand what's failing and help prevent regression in the future. |
Signed-off-by: Ondrej Zacha <ondrej.zacha@okra.ai>
Makes sense! I kept the original test, with a comment – it adds some duplication but hopefully should help with consistency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for opening this PR @ondrejzacha ! On first review the implementation looks good. I do need some more time to think about this though. I feel like the assumption that parameters
are not changed makes sense, but I wonder if there are cases where actually they do change and what consequences this change would have for those types of pipelines.
I'd be interested to hear what @idanov and @deepyaman have to say about this, because they've known Kedro for longer.
…jzacha/kedro into improve-resume-suggestions
Hi Kedro team, this PR has been open for a while; is this something you'd like to see merged? My team has successfully been using this adjustment (in the form of a hook) and it saved us hours of pipeline runtime – but I understand this may potentially negatively affect other use-cases. Let me know if there's interest, I'm happy to adjust/add more tests if needed |
@ondrejzacha I personally think this is a good change, and is a step towards more efficient pipeline re-run beyond a single session run. I do find that the leaky definition of |
Is there different terminology I could use? Or perhaps referring directly to "catalogued" datasets and "parameters"? |
# Important difference vs. Kedro approach | ||
if node_input.startswith("params:"): | ||
continue | ||
if isinstance(catalog_datasets[node_input], MemoryDataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use catalog.load
instead of catalog._datasets
? If i understand you want to perserve params:
and catalog.load
should work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to load the dataset though? Isn't that a bit of overkill?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this approach I need to check the Dataset instance, not exactly its data contents
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies on the delay of addressing your PR @ondrejzacha. There's been a lot going on and it slipped my mind. Let's try get it merged soon now 🙂
I've left some small comments and questions. But also wanted to discuss what @noklam said:
I do find that the leaky definition of persistent_input is a bit weird and as a result in this PR you have to use something like impersistent.
@noklam Do you mean here you don't like the terminology of the word impersistent
or do you think the implementation itself is leaky and should be changed?
If it's about terminology, I do agree that impersistent
is a tricky word. It's of course the correct opposite of persistent
, but for a non-native speaker (me) it makes it harder to read the code for some reason. Maybe just call it non_persistent
? 😅 In any case I think it would be helpful to clarify what we mean with impersistent
/non_persistent
in the context of Kedro. From reading the code (specifically _enumerate_impersistent_inputs
), I interpret the definition being anything that is a MemoryDataset
. Even if it's defined in the catalog.
|
||
""" | ||
# We use _data_sets because they pertain parameter name format | ||
catalog_datasets = catalog._datasets |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify, does this mean that catalog.dataset
(the public attribute) does not contain the parameter name format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@noklam Do you mean here you don't like the terminology of the word impersistent or do you think the implementation itself is leaky and should be changed?
If it's about terminology, I do agree that impersistent is a tricky word. It's of course the correct opposite of persistent, but for a non-native speaker (me) it makes it harder to read the code for some reason. Maybe just call it non_persistent? 😅 In any case I think it would be helpful to clarify what we mean with impersistent/non_persistent in the context of Kedro. From reading the code (specifically _enumerate_impersistent_inputs), I interpret the definition being anything that is a MemoryDataset. Even if it's defined in the catalog.
@merelcht Mostly terminology, it's just hard to remember as previously it's just MemoryDataset, now is basically memoryDataset + params.
New thought: #3520 - any chance we can utilise the new flag that we added? Can we load parameters
as a "special" kind of MemoryDataset that will have self._EPHEMERAL = False
? We can take this PR first and refactor this afterward if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
catalog.datasets
members need to be accessible as attributes, so colons and periods are replaced with __
. I agree this is very far from an elegant solution, so I'm definitely open to suggestions
|
||
|
||
def _enumerate_impersistent_inputs(node: Node, catalog: DataCatalog) -> set[str]: | ||
"""Enumerate impersistent input Datasets of a ``Node``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"""Enumerate impersistent input Datasets of a ``Node``. | |
"""Enumerate impersistent input datasets of a ``Node``. |
def _enumerate_parents(pipeline: Pipeline, child: Node) -> list[Node]: | ||
"""For a given ``Node``, returns a list containing the direct parents | ||
of that ``Node`` in the given ``Pipeline``. | ||
return missing_inputs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe call this impersistent_inputs
instead? It's not immediately clear that impersistent and missing are meant to be the same thing.
catalog_datasets = catalog._datasets | ||
missing_inputs: set[str] = set() | ||
for node_input in node.inputs: | ||
# Important difference vs. Kedro approach |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget to remove this comment, or otherwise clarify that parameters are treated as persistent.
# Important difference vs. Kedro approach | ||
if node_input.startswith("params:"): | ||
continue | ||
if isinstance(catalog_datasets[node_input], MemoryDataset): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to load the dataset though? Isn't that a bit of overkill?
Hi both @noklam, @merelcht – thanks for the renewed interest for this PR! I wasn't sure if there are still intentions to merge it so I haven't pushed any changes here, but I did experiment a bit on a side and came up with what I believe is a better solution here: see diff here. It's a bit more elaborate approach which separates the logic into a separate function. On one hand it's a larger change, on the other it's easier to test. Overall it adds these benefits:
Do you think this is a good direction? I could also create a new PR and make sure all commits are properly signed etc. |
Oh amazing, thanks @ondrejzacha ! The approach definitely sounds very well thought out 🙂 I'll need a bit of time to review, but I'll prioritise this for next week. Thanks for your patience 🙏 |
Opened a new PR here: #3719 |
Description
Fixes #3002
(considers parameters as persistent inputs for calculation of nodes to continue crashed pipeline from)
Development notes
This PR makes three small changes:
make test
normake test-no-spark
) on Python 3.9 on a Mac; hoping to see them pass on CI.This is my first PR at this repo, let me know if I missed anything important!
Checklist
RELEASE.md
file