Improve resume suggestions for SequentialRunner #3026

ondrejzacha · 2023-09-11T19:02:56Z

Description

Fixes #3002
(considers parameters as persistent inputs for calculation of nodes to continue crashed pipeline from)

Development notes

This PR makes three small changes:

It adds a special case for paramater "datasets" to be considered persistent.
It makes the node selection algorithm less conservative, suggesting less nodes to rerun (I believe I found a small bug; in case an input – persistent or not – is shared across all nodes of the pipeline, the suggestion would be to rerun the whole pipeline if any node crashes).
Adjusts related test to be more specific and cover more scenarios.

⚠️ Upon cloning (before any changes), I didn't get the tests to run successfully (neither make test nor make test-no-spark) on Python 3.9 on a Mac; hoping to see them pass on CI.

This is my first PR at this repo, let me know if I missed anything important!

Checklist

Read the contributing guidelines
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes
Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

- consider parameters persistent - make --from-node suggestions less conservative - extend tests Signed-off-by: Ondrej Zacha <ondrej.zacha@okra.ai>

noklam · 2023-09-20T11:22:42Z

@ondrejzacha Thanks a lot for the PR! I am glad that someone is using this for interactive workflow and get it a try. Please let me know if you have more suggestion. IMO the session constrains is sometimes too rigid for interactive purpose, but we need to gather more feedback before we come up with any implementation.

Sorry for the slow response, at the moment the team is very busy at something else. I have a quick look at the PR. It makes sense and I agree parameters should be treated as persisted. I'd suggest not to change the original test but add a new one. It helps to understand what's failing and help prevent regression in the future.

Signed-off-by: Ondrej Zacha <ondrej.zacha@okra.ai>

ondrejzacha · 2023-09-25T20:49:18Z

I'd suggest not to change the original test but add a new one. It helps to understand what's failing and help prevent regression in the future.

Makes sense! I kept the original test, with a comment – it adds some duplication but hopefully should help with consistency

merelcht

Thank you for opening this PR @ondrejzacha ! On first review the implementation looks good. I do need some more time to think about this though. I feel like the assumption that parameters are not changed makes sense, but I wonder if there are cases where actually they do change and what consequences this change would have for those types of pipelines.

I'd be interested to hear what @idanov and @deepyaman have to say about this, because they've known Kedro for longer.

kedro/runner/runner.py

…jzacha/kedro into improve-resume-suggestions

…tions

ondrejzacha · 2024-02-10T20:33:02Z

Hi Kedro team, this PR has been open for a while; is this something you'd like to see merged? My team has successfully been using this adjustment (in the form of a hook) and it saved us hours of pipeline runtime – but I understand this may potentially negatively affect other use-cases. Let me know if there's interest, I'm happy to adjust/add more tests if needed

noklam · 2024-02-10T23:52:51Z

@ondrejzacha I personally think this is a good change, and is a step towards more efficient pipeline re-run beyond a single session run. I do find that the leaky definition of persistent_input is a bit weird and as a result in this PR you have to use something like impersistent.

ondrejzacha · 2024-02-14T08:05:30Z

I do find that the leaky definition of persistent_input is a bit weird and as a result in this PR you have to use something like impersistent.

Is there different terminology I could use? Or perhaps referring directly to "catalogued" datasets and "parameters"?

noklam · 2024-03-14T17:46:01Z

kedro/runner/runner.py

+        # Important difference vs. Kedro approach
+        if node_input.startswith("params:"):
+            continue
+        if isinstance(catalog_datasets[node_input], MemoryDataset):


Can we use catalog.load instead of catalog._datasets? If i understand you want to perserve params: and catalog.load should work

Is it necessary to load the dataset though? Isn't that a bit of overkill?

In this approach I need to check the Dataset instance, not exactly its data contents

merelcht

Apologies on the delay of addressing your PR @ondrejzacha. There's been a lot going on and it slipped my mind. Let's try get it merged soon now 🙂

I've left some small comments and questions. But also wanted to discuss what @noklam said:

I do find that the leaky definition of persistent_input is a bit weird and as a result in this PR you have to use something like impersistent.

@noklam Do you mean here you don't like the terminology of the word impersistent or do you think the implementation itself is leaky and should be changed?

If it's about terminology, I do agree that impersistent is a tricky word. It's of course the correct opposite of persistent, but for a non-native speaker (me) it makes it harder to read the code for some reason. Maybe just call it non_persistent? 😅 In any case I think it would be helpful to clarify what we mean with impersistent/non_persistent in the context of Kedro. From reading the code (specifically _enumerate_impersistent_inputs), I interpret the definition being anything that is a MemoryDataset. Even if it's defined in the catalog.

merelcht · 2024-03-14T14:39:40Z

kedro/runner/runner.py

+
+    """
+    # We use _data_sets because they pertain parameter name format
+    catalog_datasets = catalog._datasets


Just to clarify, does this mean that catalog.dataset (the public attribute) does not contain the parameter name format?

@noklam Do you mean here you don't like the terminology of the word impersistent or do you think the implementation itself is leaky and should be changed?

If it's about terminology, I do agree that impersistent is a tricky word. It's of course the correct opposite of persistent, but for a non-native speaker (me) it makes it harder to read the code for some reason. Maybe just call it non_persistent? 😅 In any case I think it would be helpful to clarify what we mean with impersistent/non_persistent in the context of Kedro. From reading the code (specifically _enumerate_impersistent_inputs), I interpret the definition being anything that is a MemoryDataset. Even if it's defined in the catalog.

@merelcht Mostly terminology, it's just hard to remember as previously it's just MemoryDataset, now is basically memoryDataset + params.

New thought: #3520 - any chance we can utilise the new flag that we added? Can we load parameters as a "special" kind of MemoryDataset that will have self._EPHEMERAL = False? We can take this PR first and refactor this afterward if needed.

catalog.datasets members need to be accessible as attributes, so colons and periods are replaced with __. I agree this is very far from an elegant solution, so I'm definitely open to suggestions

merelcht · 2024-03-14T14:40:53Z

kedro/runner/runner.py

+
+
+def _enumerate_impersistent_inputs(node: Node, catalog: DataCatalog) -> set[str]:
+    """Enumerate impersistent input Datasets of a ``Node``.


Suggested change

"""Enumerate impersistent input Datasets of a ``Node``.

"""Enumerate impersistent input datasets of a ``Node``.

merelcht · 2024-03-14T14:42:05Z

kedro/runner/runner.py

-def _enumerate_parents(pipeline: Pipeline, child: Node) -> list[Node]:
-    """For a given ``Node``, returns a list containing the direct parents
-    of that ``Node`` in the given ``Pipeline``.
+    return missing_inputs


Maybe call this impersistent_inputs instead? It's not immediately clear that impersistent and missing are meant to be the same thing.

merelcht · 2024-03-14T14:42:57Z

kedro/runner/runner.py

+    catalog_datasets = catalog._datasets
+    missing_inputs: set[str] = set()
+    for node_input in node.inputs:
+        # Important difference vs. Kedro approach


Don't forget to remove this comment, or otherwise clarify that parameters are treated as persistent.

merelcht · 2024-03-15T11:42:45Z

kedro/runner/runner.py

+        # Important difference vs. Kedro approach
+        if node_input.startswith("params:"):
+            continue
+        if isinstance(catalog_datasets[node_input], MemoryDataset):


Is it necessary to load the dataset though? Isn't that a bit of overkill?

ondrejzacha · 2024-03-15T12:31:16Z

Hi both @noklam, @merelcht – thanks for the renewed interest for this PR! I wasn't sure if there are still intentions to merge it so I haven't pushed any changes here, but I did experiment a bit on a side and came up with what I believe is a better solution here: see diff here.

It's a bit more elaborate approach which separates the logic into a separate function. On one hand it's a larger change, on the other it's easier to test. Overall it adds these benefits:

if dataset (or param) is persistent & shared, it won't keep looking for ancestors
it only looks for ancestors producing impersistent inputs
it minimizes the number of suggested nodes (= shorter message for the same pipeline)
it's testable (therefore more applicable to other runners too)

Do you think this is a good direction? I could also create a new PR and make sure all commits are properly signed etc.

merelcht · 2024-03-15T15:37:28Z

Oh amazing, thanks @ondrejzacha ! The approach definitely sounds very well thought out 🙂 I'll need a bit of time to review, but I'll prioritise this for next week. Thanks for your patience 🙏

ondrejzacha · 2024-03-18T07:42:57Z

Opened a new PR here: #3719

Improve resume suggestions for SequentialRunner

994d6b5

- consider parameters persistent - make --from-node suggestions less conservative - extend tests Signed-off-by: Ondrej Zacha <ondrej.zacha@okra.ai>

ondrejzacha requested a review from merelcht as a code owner September 11, 2023 19:02

Merge branch 'main' into improve-resume-suggestions

688f1eb

astrojuanlu added the Community Issue/PR opened by the open-source community label Sep 14, 2023

noklam self-requested a review September 20, 2023 11:22

ondrejzacha added 2 commits September 25, 2023 21:19

Merge branch 'kedro-org:main' into improve-resume-suggestions

bf24347

Keep original test definition

7ce2ea6

Signed-off-by: Ondrej Zacha <ondrej.zacha@okra.ai>

merelcht reviewed Oct 11, 2023

View reviewed changes

kedro/runner/runner.py Show resolved Hide resolved

merelcht requested review from idanov and deepyaman October 11, 2023 15:42

ondrejzacha and others added 2 commits November 17, 2023 21:12

Skip ancestors with persistent inputs

36294d5

Merge branch 'main' into improve-resume-suggestions

0214c4e

merelcht added Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation TD: should we? Tech Design topic to discuss whether we should implement/solve a raised issue labels Jan 3, 2024

ondrejzacha added 2 commits February 9, 2024 19:58

Merge branch 'improve-resume-suggestions' of https://github.com/ondre…

10db9dc

…jzacha/kedro into improve-resume-suggestions

Merge remote-tracking branch 'origin/main' into improve-resume-sugges…

700645c

…tions

noklam reviewed Mar 14, 2024

View reviewed changes

Merge branch 'main' into improve-resume-suggestions

9e25f6b

noklam mentioned this pull request Mar 15, 2024

Proposal for Partial/Custom node ordering for SequentialRunner #3717

Closed

merelcht reviewed Mar 15, 2024

View reviewed changes

ondrejzacha mentioned this pull request Mar 18, 2024

Improve resume suggestions #3719

Merged

7 tasks

ondrejzacha closed this Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve resume suggestions for SequentialRunner #3026

Improve resume suggestions for SequentialRunner #3026

ondrejzacha commented Sep 11, 2023

noklam commented Sep 20, 2023

ondrejzacha commented Sep 25, 2023

merelcht left a comment

ondrejzacha commented Feb 10, 2024

noklam commented Feb 10, 2024

ondrejzacha commented Feb 14, 2024

noklam Mar 14, 2024

merelcht Mar 15, 2024

ondrejzacha Mar 18, 2024

merelcht left a comment

merelcht Mar 14, 2024

noklam Mar 15, 2024

ondrejzacha Mar 18, 2024

merelcht Mar 14, 2024

merelcht Mar 14, 2024

merelcht Mar 14, 2024

merelcht Mar 15, 2024

ondrejzacha commented Mar 15, 2024

merelcht commented Mar 15, 2024

ondrejzacha commented Mar 18, 2024



		def _enumerate_impersistent_inputs(node: Node, catalog: DataCatalog) -> set[str]:
		"""Enumerate impersistent input Datasets of a ``Node``.

Improve resume suggestions for SequentialRunner #3026

Improve resume suggestions for SequentialRunner #3026

Conversation

ondrejzacha commented Sep 11, 2023

Description

Development notes

Checklist

noklam commented Sep 20, 2023

ondrejzacha commented Sep 25, 2023

merelcht left a comment

Choose a reason for hiding this comment

ondrejzacha commented Feb 10, 2024

noklam commented Feb 10, 2024

ondrejzacha commented Feb 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

merelcht left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ondrejzacha commented Mar 15, 2024

merelcht commented Mar 15, 2024

ondrejzacha commented Mar 18, 2024