Skip node at runtime #2410

sbrugman · 2023-03-10T16:54:35Z

There currently is no way (that I know of) to skip a node at runtime (e.g. from a hook), without failing the pipeline run.

Is there already an idiomatic way of doing so? e.g. build a custom runner, with a function similar to run_only_missing?

Alternatives considered:

Overwriting the function with a no-op will still save continue with saving the dataset, which should be avoided in that case.
Removing the node might might have unintended side-effects
Alternatively, the pipeline could be built dynamically. A downside is that the hooks abstraction cannot be used then (for determining to skip a node), so possibly this has much boilerplate/overhead.

If not, is this something that is welcome to be contributed? It could be a fairly simple and generic addition. (Happy to add)

(related to #2307)

The text was updated successfully, but these errors were encountered:

datajoely · 2023-03-10T17:12:03Z

How would you like this to work if it existed? Is it based on a condition or is it known pre-run?

sbrugman · 2023-03-10T17:17:09Z

What would work well in the case above is that a node can be skipped (simply a boolean flag), that can be set in the before_node_run. (Might need some extra thinking)

Indeed based on a condition only known at runtime. In the referenced issue this would be a cache hit, however I can imagine use cases with other conditions.

(Note that Github Actions, Azure DevOps pipelines and related tools do support this and could be a source of inspiration.)

antonymilne · 2023-03-14T15:05:23Z

I think there's going to be people who disagree (e.g. @idanov) but personally I like this idea and think doing it with hooks feels very natural. before_node_run already enables some "advanced" behaviour where you can return a dictionary to dynamically override node inputs. We could also make some sentinel value SKIP, and if the hook returns that value then skip execution of the node.

Three other ideas that are already possible but I suspect won't offer the full flexible dynamic functionality you'd like. They could also be used in combination:

Like you suggest, take the code from run_only_missing and use it to define your own custom runner. Put this in <project-name>/src/<python_package>/runner.py and then do kedro run --runner=<python_package>.runner.MissingOnlySequentialRunner
Use Pipeline.filter in your pipeline_registry.py to register a pipeline skip_nodes and then run with kedro run -p skip_nodes.
The no-op idea: the key to getting this working I think would be to override node.run and not node.func as you might expect. Take a look at https://gist.github.com/mzjp2/076bfd73b0215bda01ee71186966389d and the discussion it came from DVC Plugin to skip Nodes if Data and Code are up to date #837.

Sm1Ling · 2023-11-01T09:20:12Z

Hi?
Has anyone taken this feature?)
Looking forward for it

Options:

Cache configs\hashes of configs of upper nodes. Compare them for each launch
Compare whether output dataset of node alreay exists (for instance, i name datasets task-wise. Different task will have different dataset naming. Same task will have same settings for pipeline)
Cache names of upper nodes source files, cache their memory size (cache another proxies of being changed). And compare with current launch data

sbrugman · 2023-11-28T22:26:51Z

Our team is working on a kedro runner for this. PyCodeHash was just released and solves the heavy lifting of hashing functions and datasets.

sbrugman added the Issue: Feature Request New feature or improvement to existing feature label Mar 10, 2023

merelcht added the Community Issue/PR opened by the open-source community label Mar 13, 2023

merelcht added this to Kedro Framework Mar 13, 2023

merelcht mentioned this issue Mar 27, 2023

Automatically skip running nodes with persisted outputs #2307

Open

astrojuanlu mentioned this issue Apr 4, 2024

VOTE: Add Simon Brugman to TSC #3780

Merged

7 tasks

merelcht removed the Community Issue/PR opened by the open-source community label Jul 8, 2024

merelcht added this to the Something about Runners milestone Sep 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Skip node at runtime #2410

Skip node at runtime #2410

sbrugman commented Mar 10, 2023 •

edited

Loading

datajoely commented Mar 10, 2023

sbrugman commented Mar 10, 2023 •

edited

Loading

antonymilne commented Mar 14, 2023 •

edited

Loading

Sm1Ling commented Nov 1, 2023

sbrugman commented Nov 28, 2023

Skip node at runtime #2410

Skip node at runtime #2410

Comments

sbrugman commented Mar 10, 2023 • edited Loading

datajoely commented Mar 10, 2023

sbrugman commented Mar 10, 2023 • edited Loading

antonymilne commented Mar 14, 2023 • edited Loading

Sm1Ling commented Nov 1, 2023

sbrugman commented Nov 28, 2023

sbrugman commented Mar 10, 2023 •

edited

Loading

sbrugman commented Mar 10, 2023 •

edited

Loading

antonymilne commented Mar 14, 2023 •

edited

Loading