Automatically skip running nodes with persisted outputs #2307

jmholzer · 2023-02-10T16:35:14Z

Description

Re-running nodes which have:

Persisted outputs
No upstream dependencies which would cause their output to change

is an unnecessary expense. It might be a good idea to have a flag which would automatically skip running these nodes.

It is currently possible to achieve this by specifying nodes to run from, though this process is manual and potentially error-prone.

Context

User @pedro-sarpen opened #2005 to address this issue, though there may be a better solution to the problem that we should investigate.

antonymilne · 2023-02-14T06:30:06Z

This is definitely something we should have, although I don't have any concrete ideas on the best way to do it off the top of my head. The broader question of "change capture" has been discussed before but I don't think anything was properly decided on. Maybe now would be the right time to re-open those discussions.

marcosfelt · 2023-03-02T11:16:39Z

I'd like to add that I'd love this feature. Currently, I have to comment out nodes in my pipelines and add their outputs to the inputs of the pipeline. That's really tedious and seems like an anti-pattern.

sbrugman · 2023-03-10T16:47:11Z

(Our team is working on this and plan to open-source)

merelcht · 2023-03-27T13:11:18Z

Linking: #2410

astrojuanlu · 2023-11-26T22:06:34Z

xref change capture #221

To me, the main difficulty is that doing this requires making assumptions about the node functions, in particular that they're pure, i.e. that they don't have any spurious inputs, like randomness, the current date, and so on. If we assume so, then doing some sort of hashing on the inputs is technically sufficient.

As I said in #221, this would make kedro run no longer stateless.

sbrugman · 2023-11-28T21:51:42Z

Related Update: our team open-sourced pycodehash just now, and are working on a Kedro runner that is able to skip cached datasets and nodes.

astrojuanlu · 2023-11-29T12:02:05Z

@sbrugman I was having a look at PyCodeHash, looks superb!

One question: what can we do for cases like these?

def preprocess_data(df: pl.DataFrame) -> pl.DataFrame:
    now = dt.datetime.now()
    if now.minute % 2 == 0:
        raise Exception("boom")
    return df.head()

? These would effectively be cached, am I right?

sbrugman · 2023-11-29T17:55:08Z

This one is not deterministic. The random component (time) should be a parameter/dataset in order for this to work.

(Idempotent pipelines are required)

astrojuanlu · 2024-02-04T19:25:10Z

Closed #221 as a duplicate of this one. The former is older and has some extra context.

astrojuanlu · 2024-02-04T19:26:24Z

Previously: #30, #25, #82.

astrojuanlu · 2024-07-23T08:36:02Z

After showing Kedro to a data scientist, this was the first thing they asked. They were familiar with DVC.

jmholzer added the Issue: Feature Request New feature or improvement to existing feature label Feb 10, 2023

jmholzer mentioned this issue Feb 10, 2023

Draft Pull Request : Add incremental run method #2005

Closed

merelcht added this to the Something about Runners milestone Feb 16, 2023

sbrugman mentioned this issue Mar 10, 2023

Skip node at runtime #2410

Open

merelcht added the Stage: Technical Design 🎨 Ticket needs to undergo technical design before implementation label Mar 13, 2023

merelcht assigned jmholzer Mar 27, 2023

merelcht unassigned jmholzer Jul 27, 2023

merelcht added the TD: implementation Tech Design topic on implementation of the issue label Jul 28, 2023

astrojuanlu mentioned this issue Jan 15, 2024

[Debugging] Show which datasets are outdated kedro-org/kedro-viz#1704

Closed

2 tasks

astrojuanlu mentioned this issue Feb 4, 2024

Incremental runs/"Run only missing" #221

Closed

astrojuanlu mentioned this issue Apr 4, 2024

VOTE: Add Simon Brugman to TSC #3780

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically skip running nodes with persisted outputs #2307

Automatically skip running nodes with persisted outputs #2307

jmholzer commented Feb 10, 2023

antonymilne commented Feb 14, 2023

marcosfelt commented Mar 2, 2023

sbrugman commented Mar 10, 2023

merelcht commented Mar 27, 2023

astrojuanlu commented Nov 26, 2023

sbrugman commented Nov 28, 2023

astrojuanlu commented Nov 29, 2023

sbrugman commented Nov 29, 2023 •

edited

Loading

astrojuanlu commented Feb 4, 2024

astrojuanlu commented Feb 4, 2024

astrojuanlu commented Jul 23, 2024

Automatically skip running nodes with persisted outputs #2307

Automatically skip running nodes with persisted outputs #2307

Comments

jmholzer commented Feb 10, 2023

Description

Context

antonymilne commented Feb 14, 2023

marcosfelt commented Mar 2, 2023

sbrugman commented Mar 10, 2023

merelcht commented Mar 27, 2023

astrojuanlu commented Nov 26, 2023

sbrugman commented Nov 28, 2023

astrojuanlu commented Nov 29, 2023

sbrugman commented Nov 29, 2023 • edited Loading

astrojuanlu commented Feb 4, 2024

astrojuanlu commented Feb 4, 2024

astrojuanlu commented Jul 23, 2024

sbrugman commented Nov 29, 2023 •

edited

Loading