Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve resume suggestions for SequentialRunner #3026

Closed

Conversation

ondrejzacha
Copy link
Contributor

Description

Fixes #3002
(considers parameters as persistent inputs for calculation of nodes to continue crashed pipeline from)

Development notes

This PR makes three small changes:

  • It adds a special case for paramater "datasets" to be considered persistent.
  • It makes the node selection algorithm less conservative, suggesting less nodes to rerun (I believe I found a small bug; in case an input – persistent or not – is shared across all nodes of the pipeline, the suggestion would be to rerun the whole pipeline if any node crashes).
  • Adjusts related test to be more specific and cover more scenarios.

⚠️ Upon cloning (before any changes), I didn't get the tests to run successfully (neither make test nor make test-no-spark) on Python 3.9 on a Mac; hoping to see them pass on CI.

This is my first PR at this repo, let me know if I missed anything important!

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

- consider parameters persistent
- make --from-node suggestions less conservative
- extend tests

Signed-off-by: Ondrej Zacha <ondrej.zacha@okra.ai>
@astrojuanlu astrojuanlu added the Community Issue/PR opened by the open-source community label Sep 14, 2023
@noklam
Copy link
Contributor

noklam commented Sep 20, 2023

@ondrejzacha Thanks a lot for the PR! I am glad that someone is using this for interactive workflow and get it a try. Please let me know if you have more suggestion. IMO the session constrains is sometimes too rigid for interactive purpose, but we need to gather more feedback before we come up with any implementation.

Sorry for the slow response, at the moment the team is very busy at something else. I have a quick look at the PR. It makes sense and I agree parameters should be treated as persisted. I'd suggest not to change the original test but add a new one. It helps to understand what's failing and help prevent regression in the future.

@noklam noklam self-requested a review September 20, 2023 11:22
@ondrejzacha
Copy link
Contributor Author

I'd suggest not to change the original test but add a new one. It helps to understand what's failing and help prevent regression in the future.

Makes sense! I kept the original test, with a comment – it adds some duplication but hopefully should help with consistency

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for opening this PR @ondrejzacha ! On first review the implementation looks good. I do need some more time to think about this though. I feel like the assumption that parameters are not changed makes sense, but I wonder if there are cases where actually they do change and what consequences this change would have for those types of pipelines.

I'd be interested to hear what @idanov and @deepyaman have to say about this, because they've known Kedro for longer.

kedro/runner/runner.py Show resolved Hide resolved
@merelcht merelcht added Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation TD: should we? Tech Design topic to discuss whether we should implement/solve a raised issue labels Jan 3, 2024
@ondrejzacha
Copy link
Contributor Author

Hi Kedro team, this PR has been open for a while; is this something you'd like to see merged? My team has successfully been using this adjustment (in the form of a hook) and it saved us hours of pipeline runtime – but I understand this may potentially negatively affect other use-cases. Let me know if there's interest, I'm happy to adjust/add more tests if needed

@noklam
Copy link
Contributor

noklam commented Feb 10, 2024

@ondrejzacha I personally think this is a good change, and is a step towards more efficient pipeline re-run beyond a single session run. I do find that the leaky definition of persistent_input is a bit weird and as a result in this PR you have to use something like impersistent.

@ondrejzacha
Copy link
Contributor Author

I do find that the leaky definition of persistent_input is a bit weird and as a result in this PR you have to use something like impersistent.

Is there different terminology I could use? Or perhaps referring directly to "catalogued" datasets and "parameters"?

# Important difference vs. Kedro approach
if node_input.startswith("params:"):
continue
if isinstance(catalog_datasets[node_input], MemoryDataset):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use catalog.load instead of catalog._datasets? If i understand you want to perserve params: and catalog.load should work

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to load the dataset though? Isn't that a bit of overkill?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this approach I need to check the Dataset instance, not exactly its data contents

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies on the delay of addressing your PR @ondrejzacha. There's been a lot going on and it slipped my mind. Let's try get it merged soon now 🙂

I've left some small comments and questions. But also wanted to discuss what @noklam said:

I do find that the leaky definition of persistent_input is a bit weird and as a result in this PR you have to use something like impersistent.

@noklam Do you mean here you don't like the terminology of the word impersistent or do you think the implementation itself is leaky and should be changed?

If it's about terminology, I do agree that impersistent is a tricky word. It's of course the correct opposite of persistent, but for a non-native speaker (me) it makes it harder to read the code for some reason. Maybe just call it non_persistent? 😅 In any case I think it would be helpful to clarify what we mean with impersistent/non_persistent in the context of Kedro. From reading the code (specifically _enumerate_impersistent_inputs), I interpret the definition being anything that is a MemoryDataset. Even if it's defined in the catalog.


"""
# We use _data_sets because they pertain parameter name format
catalog_datasets = catalog._datasets
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, does this mean that catalog.dataset (the public attribute) does not contain the parameter name format?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noklam Do you mean here you don't like the terminology of the word impersistent or do you think the implementation itself is leaky and should be changed?

If it's about terminology, I do agree that impersistent is a tricky word. It's of course the correct opposite of persistent, but for a non-native speaker (me) it makes it harder to read the code for some reason. Maybe just call it non_persistent? 😅 In any case I think it would be helpful to clarify what we mean with impersistent/non_persistent in the context of Kedro. From reading the code (specifically _enumerate_impersistent_inputs), I interpret the definition being anything that is a MemoryDataset. Even if it's defined in the catalog.

@merelcht Mostly terminology, it's just hard to remember as previously it's just MemoryDataset, now is basically memoryDataset + params.

New thought: #3520 - any chance we can utilise the new flag that we added? Can we load parameters as a "special" kind of MemoryDataset that will have self._EPHEMERAL = False? We can take this PR first and refactor this afterward if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

catalog.datasets members need to be accessible as attributes, so colons and periods are replaced with __. I agree this is very far from an elegant solution, so I'm definitely open to suggestions



def _enumerate_impersistent_inputs(node: Node, catalog: DataCatalog) -> set[str]:
"""Enumerate impersistent input Datasets of a ``Node``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"""Enumerate impersistent input Datasets of a ``Node``.
"""Enumerate impersistent input datasets of a ``Node``.

def _enumerate_parents(pipeline: Pipeline, child: Node) -> list[Node]:
"""For a given ``Node``, returns a list containing the direct parents
of that ``Node`` in the given ``Pipeline``.
return missing_inputs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe call this impersistent_inputs instead? It's not immediately clear that impersistent and missing are meant to be the same thing.

catalog_datasets = catalog._datasets
missing_inputs: set[str] = set()
for node_input in node.inputs:
# Important difference vs. Kedro approach
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to remove this comment, or otherwise clarify that parameters are treated as persistent.

# Important difference vs. Kedro approach
if node_input.startswith("params:"):
continue
if isinstance(catalog_datasets[node_input], MemoryDataset):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to load the dataset though? Isn't that a bit of overkill?

@ondrejzacha
Copy link
Contributor Author

Hi both @noklam, @merelcht – thanks for the renewed interest for this PR! I wasn't sure if there are still intentions to merge it so I haven't pushed any changes here, but I did experiment a bit on a side and came up with what I believe is a better solution here: see diff here.

It's a bit more elaborate approach which separates the logic into a separate function. On one hand it's a larger change, on the other it's easier to test. Overall it adds these benefits:

  • if dataset (or param) is persistent & shared, it won't keep looking for ancestors
  • it only looks for ancestors producing impersistent inputs
  • it minimizes the number of suggested nodes (= shorter message for the same pipeline)
  • it's testable (therefore more applicable to other runners too)

Do you think this is a good direction? I could also create a new PR and make sure all commits are properly signed etc.

@merelcht
Copy link
Member

Oh amazing, thanks @ondrejzacha ! The approach definitely sounds very well thought out 🙂 I'll need a bit of time to review, but I'll prioritise this for next week. Thanks for your patience 🙏

@ondrejzacha ondrejzacha mentioned this pull request Mar 18, 2024
7 tasks
@ondrejzacha
Copy link
Contributor Author

Opened a new PR here: #3719

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Issue/PR opened by the open-source community Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation TD: should we? Tech Design topic to discuss whether we should implement/solve a raised issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve resume pipeline suggestions
5 participants