Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve resume pipeline suggestions #3002

Closed
ondrejzacha opened this issue Sep 3, 2023 · 3 comments
Closed

Improve resume pipeline suggestions #3002

ondrejzacha opened this issue Sep 3, 2023 · 3 comments
Assignees
Labels
Community Issue/PR opened by the open-source community Issue: Feature Request New feature or improvement to existing feature

Comments

@ondrejzacha
Copy link
Contributor

ondrejzacha commented Sep 3, 2023

Description

Currently, when the pipeline crashes (using SequentialRunner), a suggestion is provided for a --from-nodes value to resume the pipeline using persisted inputs. However, if pipeline nodes use parameters as input values, they are considered transitory (as they are represented as MemoryDatasets). As a result, the suggested pipelines are often a lot larger than necessary (assuming the parameters do not change between runs).

Consider the following pipeline, assuming first_output, second_output and third_output all defined in the catalog and value defined in a parameters config file:

Pipeline definition

from typing import Dict
from kedro.pipeline import Pipeline, node

def first_func():
    return 1

def second_func(x, y):
    return x + y

def third_func(x, y):
    raise RuntimeError

def register_pipelines() -> Dict[str, Pipeline]:
    pipeline = Pipeline(
        [
            node(
                name="first_node",
                func=first_func,  # ok
                inputs={},
                outputs="first_output",
            ),
            node(
                name="second_node",
                func=second_func,  # ok
                inputs={"x": "first_output", "y": "params:value"},
                outputs="second_output",
            ),
            node(
                name="third_node",
                func=third_func,  # fails
                inputs={"x": "second_output", "y": "params:value"},
                outputs="third_output",
            ),
        ]
    )

    return {
        "__default__": pipeline,
    }

The pipeline crashes at third_node. Because both third_func and second_func use a parameterized input value, the runner suggests rerunning both of them, suggesting re-running the pipeline with --from-nodes "first_node".

Since I don't change the parameters between runs, fixing third_func and re-running third_node only should be enough to finish the full pipeline – so --from-nodes "third_node" would be a desired suggestion here.

If parameters were considered persistent inputs, the suggested resumed pipelines could be smaller – unless this violates any expectations regarding pipeline params?

Context

In my team, we use often parameterized node inputs and our pipelines can run for 1+ hours, so this can have a significant impact – especially for interactive work. I implemented a hook to generate an alternative resume suggestion, but I believe this change could be useful to other people too. I can open a PR if this is a welcome addition.

This feature has been tackled in #1795.

Possible Implementation

  • in kedro.runner.runner._has_persistent_inputs, skip evaluation of parameters.

Possible Alternatives

If the default behavior is desired, this alternative suggestion could also be provided as an additional hook.

@ondrejzacha ondrejzacha added the Issue: Feature Request New feature or improvement to existing feature label Sep 3, 2023
@astrojuanlu astrojuanlu added the Community Issue/PR opened by the open-source community label Sep 3, 2023
@astrojuanlu
Copy link
Member

Thanks a lot for the writeup @ondrejzacha ! It will take us some time to get to this issue because we are prioritizing the 0.19 release, but we will get there.

Provided that we consider it a desirable feature request, would you like to open a pull request for it?

@ondrejzacha
Copy link
Contributor Author

Sure, I'll give it a go!

@merelcht
Copy link
Member

merelcht commented Apr 2, 2024

Completed in #3719

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community Issue/PR opened by the open-source community Issue: Feature Request New feature or improvement to existing feature
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants