[spike] Investigate suitability of Kedro for EL pipelines and incremental data loading #3578

astrojuanlu · 2024-01-30T18:13:31Z

Intro and context

Kedro describes itself in its README as a tool for data science and data engineering pipelines (emphasis mine):

Kedro is a toolbox for production-ready data science. It uses software engineering best practices to help you create data engineering and data science pipelines that are reproducible, maintainable, and modular.

As per kedro-org/kedro-devrel#94, these "data engineering and data science pipelines" actually reflect the broad categories that people have in mind when talking about "pipelines", which are

Data pipelines: ETL/ELT (data ingestion, with or without transformation, from the source to a centralized location, for example a data warehouse)
Machine learning pipelines: the "ML code" part of the classic MLOps architecture

The focus of this issue is on data pipelines.

Data pipelines

Data pipelines are important because they are the beginning of any data project: you need to get your data from somewhere, to then start doing analysis, machine learning, and the like.

Data pipelines are tricky. For ETL architectures, the Transformation needs to be executed carefully, and it's coupled to both the source (Extraction) and target (Loading). ELT is touted as the "modern" approach, but creates a big overhead of often denormalised tables on the data warehouse.

According to industry surveys kedro-org/kedro-devrel#94, most teams use in-house tools, or just no recognizable tools at all (a mess of Python scripts, Jupyter notebooks, and the like), which suggests that most teams are doing ETL as opposed to ELT. The most recognizable tools and vendors focus on ELT and are commercial (Fivetran, Azure Data Factory) whereas the existing open source tools have mixed reviews (Airbyte, Meltano).

Kedro for data pipelines

We have evidence of users using Kedro for authoring data pipelines https://linen-slack.kedro.org/t/16312377/hi-everyone-here-luca-ds-from-italy-happy-kedro-user-for-3-y#2d666fee-5385-45d2-b2f8-4282ef22c2f9

However, there are also some signs that hint that there's margin for improvement for Kedro to be suitable for creating data pipelines:

Some Kedro projects give up using Kedro for data ingestion and use Bash scripts instead https://github.com/deepyaman/inauditus/blob/develop/refresh-data
There is not a blessed way to extract data from unusual sources (see https://github.com/astrojuanlu/kedro-kaggle-dataset/ for my own attempt at writing a "Kaggle Dataset")
Some unstated Kedro principles seem to put a strong emphasis on reproducibility, whereas the key property for data pipelines is idempotency How to use IncrementalDataset with non file-based datasets? kedro-plugins#471 (comment)
It's unclear how to write Kedro datasets that are amenable to UPSERT (aka MERGE aka "INSERT or UPDATE") operations [KED-2891] Implement spark.DeltaTable dataset #964

The fact that Kedro is not mentioned in any industry survey we have found kedro-org/kedro-devrel#94 is probably a symptom, rather than a cause, of all the above.

There's two sides of this problem:

There might be some friction because of technical difficulties, and/or
There might be a lack of educational material or guidance on how to use Kedro for these tasks.

Next steps

Part of this intersects with #1778, #1936 cc @merelcht

From a product perspective it's worth asking whether we want to pursue making Kedro a suitable tool for ETL/ELT pipelines at all. Regarding ELT, Kedro will probably never be as convenient as the Singer ecosystem and derivatives could theoretically be - however, the practical application of Meltano and Airbyte leaves some gaps, and maybe Kedro could be a satisfying tool for some users. Regarding ETL, I think Kedro could be a perfect framework for this, provided that the datasets, the crucial bits that perform the I/O, are up to the task or at least we provide clear guidance of what is the "Kedronic" way of achieving idempotent data ingestion pipelines that can support cold starts, behave well under changes of the source schema, and any other desirable properties of data pipelines.

From a technical perspective, on the other hand, we need to develop an understanding of how Kedro can be used nowadays for ELT/ETL pipelines following modern data architecture patterns, and evaluate to what extent the pains described above are real or are just a matter of having better docs.

On a related note, discussion in kedro-org/kedro-plugins#471 surfaced that we might have to make some of the Kedro principles more explicit.

Finally, we should execute on messaging/value proposition updates based on the conclusions of our investigation, and probably generate appropriate educational material in the form of documentation, blog posts, and videos.

The text was updated successfully, but these errors were encountered:

datajoely · 2024-01-30T18:19:28Z

Ibis has to be central to this

astrojuanlu · 2024-02-04T09:38:28Z

To give a specific example of how this is posing a problem to users: https://linen-slack.kedro.org/t/16366189/tldr-is-there-a-suggested-pattern-for-converting-vanilla-par#23c36a9d-7bea-40f9-a21f-cc6def7e9ccf

User tries to convert a Parquet file to a Delta table with a Kedro pipeline, only to see that DatasetError: DeltaTableDataset is a read only dataset type. Supposedly there's rationale for this in the original PR from 3 years ago #964 but (1) the conversation is extremely long, and I can't pinpoint the exact moment it was decided to remove _save() functionality, and (2) this was never documented in any place, so users are left in the dark.

Going through the PR again, I found a comment that spells the problem in detail #964 (comment)

Update, Upsert/Merge, Delete

These are not directly consistent with the Kedro Principles & DAG, as

The filepath is intrinsic to the DeltaTable

The update, merge and delete methods are methods on the DeltaTable and are immediately materialised (on call or on subsequent execute call on a merge builder)

We still need to inform the Kedro pipeline and DAG that this node has succeeded in a meaningful way

This is the problem we're addressing.

deepyaman · 2024-02-05T15:12:57Z

It's unclear how to write Kedro datasets that are amenable to UPSERT (aka MERGE aka "INSERT or UPDATE") operations #964

Upsert is mostly supported by database backends. You could simulate it in data frames using concat with indices, e.g. in pandas or spark, but it's not very clean.

For database backends, it is on the radar for Ibis support.

Regarding ELT, Kedro will probably never be as convenient as the Singer ecosystem and derivatives could theoretically be - however, the practical application of Meltano and Airbyte leaves some gaps, and maybe Kedro could be a satisfying tool for some users. Regarding ETL, I think Kedro could be a perfect framework for this

Why can Kedro not be at least as good at ELT as it is at ETL? As long as you can interact with databases natively using SQL under the hood, I think it can be a great option for the people who are going to use Python anyway (or prefer to).

Also, I'm not recent enough on this perhaps, but has there been any momentum back towards people wanting to do ETL? If ELT is still where "modern data engineering" is at, doing ETL well isn't that exciting.

Finally, we should execute on messaging/value proposition updates based on the conclusions of our investigation, and probably generate appropriate educational material in the form of documentation, blog posts, and videos.

💯

If Kedro is a tool that supports both data pipelines and ML pipelines, it makes sense that people are educated on how to write each, and don't use the same approach for the disparate problems.

astrojuanlu · 2024-02-05T15:53:48Z

Also, I'm not recent enough on this perhaps, but has there been any momentum back towards people wanting to do ETL? If ELT is still where "modern data engineering" is at, doing ETL well isn't that exciting.

We could discuss whether the Modern Data Stack was a real industry trend or only happened on Data Twitter - but I'll only do so over beer 😄

astrojuanlu · 2024-02-07T10:55:27Z

Why can Kedro not be at least as good at ELT as it is at ETL?

I'm not denying this. What I'm saying that, in theory¹,

$ pip install meltano
$ meltano add extractor tap-postgres
$ meltano add loader target-snowflake
$ meltano run

is the optimal open-source, CLI-based, EL experience, and I don't think Kedro can match this at the moment or in the near future (very happy to be proven wrong).

Edit: Meltano would be EL, then for example dbt would be T, or as Lauren Balik jokingly says, TTTTTTT

In theory there is no difference between theory and practice, while in practice there is. ↩

inigohidalgo · 2024-02-28T14:06:56Z

We've implemented an in-house upsert functionality into one of our Arrow datasets using a method @deepyaman alludes to

You could simulate it in data frames using concat with indices, e.g. in pandas or spark, but it's not very clean.

The write_mode is just a save_arg for us. This definitely breaks "reproducibility" though and goes towards idempotency like @astrojuanlu pointed out

inigohidalgo · 2024-02-28T15:29:58Z

Why can Kedro not be at least as good at ELT as it is at ETL?

I'm not denying this. What I'm saying that, in theory
$ pip install meltano
$ meltano add extractor tap-postgres
$ meltano add loader target-snowflake
$ meltano run
is the optimal open-source, CLI-based, ELT experience, and I don't think Kedro can match this at the moment or in the near future

I've never used meltano, but this covers only EL in ELT, right? Kedro+ibis could slot in very nicely into the T, and also provide more-than-good-enough performance for the EL side, though it does seem hard to beat specialized tools like meltano.

astrojuanlu · 2024-02-28T15:32:40Z

Oh, correct. I meant "the optimal [...] EL experience".

takikadiri · 2024-03-30T15:17:52Z

I would love to see Kedro fully support the "T", standing as an alternative for dbt for engine base transformation but with a python API. This could bring a huge value for some Data teams that need to juggle between two (or more) differents Technologies/frameworks and throws their works over the wall for others teams, dependencing on the stages of their Data pipelines (DE, DS/ML).

This could significally enlarge Kedro user base, as there is much more volume of work in Data & Analytics engineering than Data science & ML.

As for the "E" and "L" part Kedro could be just good enough.

datajoely · 2024-04-02T08:16:57Z

To achieve this I really believe we should go all in on Ibis as a first class citizen / prefered approach in Kedro. One syntax for broadly the backends we care about enabling the interdisciplinary collaboration @takikadiri mentions.

astrojuanlu · 2024-04-02T08:32:16Z

(From phone) To clarify, I don't think T is the problem, but rather E & L. I suspect some changes in philosophy of even API might be required that go beyond adopting Ibis, the task here is to investigate.

astrojuanlu · 2024-04-02T08:33:53Z

Although T might also require some improvements in how we approach upserts.

astrojuanlu · 2024-06-15T14:15:26Z

At PyData London I spoke to 2 different users about how they were using Kedro for their ETL pipelines and they both have challenges:

One of them has a special arrangement and performs the I/O outside of Kedro because of the current limitations, so they don't really leverage the full power of Kedro, just use it as a micro-orchestration engine.
Another one had to write their own Delta Table dataset with upserts (update credentials example for S3 bucket specs #542), plus a layer of state management for checkpointing and keeping track of what parts of the data had been ingested already.

Also, while discussing this in person with @deepyaman, I realised that both EL and T data pipelines need upserts anyway, so probably my comments above were somewhat misguided.

astrojuanlu · 2024-06-15T14:17:39Z

Inspiration: "Incremental loads should be replayable by design" (source)

datajoely · 2024-06-17T09:26:06Z

I wonder if we could make start_date and end_date first class CLI arguments and part of the session constructor?

astrojuanlu · 2024-06-17T09:45:57Z

Sometimes it would be date, sometimes it would be id... Don't think we can anticipate all possible pagination options. But regardless, I think this is more or less achievable already thanks to runtime parameters, right? The difficult thing is the upsert logic. Not from a technical perspective but from a product philosophy perspective, shifting from a focus on reproducibility to a focus on idempotency (data pipelines and machine learning pipelines might require different approaches)

datajoely · 2024-06-17T10:24:24Z

I think you can generalise upserts into the need for a conditional node...

astrojuanlu · 2024-06-17T10:48:29Z

BTW about Ibis and upserts ibis-project/ibis#5391 cc @deepyaman

astrojuanlu · 2024-07-14T21:37:56Z

Kept thinking and thinking about the reproducibility idea. From #3979 (comment) (by @datajoely) and some conversations I had during EuroPython (and the fact that I've been mulling over this almost since I joined the project):

We should acknowledge that the reproducibility principle has never been explicit. In fact, it's mentioned zero times in https://github.com/kedro-org/kedro/wiki/Kedro-Principles (established roughly 3 years ago #824).

It was said in kedro-org/kedro-plugins#471 (comment) that "pure functions + DAG + static catalog = reproducible", but as I already hinted in that thread, that only holds if you assume that the catalog points to files that are tracked by version control alongside the code itself. The moment you refer to a remote location s3://bucket/my-file.csv that principle already breaks, because Kedro doesn't do any sort of hashing of the inputs and, by definition, whatever comes from that URL is out of Kedro jurisdiction.

LLMs aren't hugely different from any other REST APIs in that regard. Even under my "1. Frozen inputs" scenario there, models and APIs aren't versioned, there's randomness without the possibility to set the seed, etc.

Hence remote data locations, database connection strings, REST APIs, LLMs... all of them can break reproducibility.

We should of course continue to try to keep functions as pure as possible (always a good thing to do) and push the I/O part to the datasets, but users are demanding a better answer to data pipelines and dynamic catalogs, so I think it's time to break free from the illusion that Kedro in itself and by itself can guarantee the reproducibility of the pipelines.

datajoely · 2024-07-15T08:10:27Z

Hence remote data locations, database connection strings, REST APIs, LLMs... all of them can break reproducibility.

This is true of any upstream data outside of your control, a SQL table/view which is changing frequently also applies. In my mind all of these roads lead back to some sort of conditional construct... you need it to do UPSERTS and a bunch of other things.

astrojuanlu · 2024-07-30T15:47:22Z

FYI, we're considering giving dlt a try and see how it works alongside Kedro.

astrojuanlu · 2024-07-30T15:48:14Z

Also renamed this issue to hopefully make it less confusing.

astrojuanlu · 2024-10-22T11:32:23Z

xref https://kedro.hall.community/risk-of-loading-full-dataset-instead-of-incremental-updates-JQn11wCgFLpT

astrojuanlu mentioned this issue Jan 30, 2024

How to use IncrementalDataset with non file-based datasets? kedro-org/kedro-plugins#471

Open

github-actions bot mentioned this issue Feb 1, 2024

Monthly issue metrics report #3582

Open

astrojuanlu mentioned this issue Feb 7, 2024

[spike] Clarify status of various Delta Table datasets kedro-org/kedro-plugins#542

Open

inigohidalgo mentioned this issue Mar 11, 2024

CachedDataset example usage #3616

Open

astrojuanlu changed the title ~~[spike] Investigate suitability of Kedro for ETL/ELT data pipelines~~ [spike] Investigate suitability of Kedro for EL pipelines and incremental data loading Jul 30, 2024

deepyaman mentioned this issue Aug 2, 2024

Define best practice for integrating Kedro and dlt #4057

Open

astrojuanlu mentioned this issue Oct 22, 2024

Downloads table is borked kedro-org/kedro-devrel#154

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[spike] Investigate suitability of Kedro for EL pipelines and incremental data loading #3578

[spike] Investigate suitability of Kedro for EL pipelines and incremental data loading #3578

astrojuanlu commented Jan 30, 2024 •

edited

Loading

Data pipelines

datajoely commented Jan 30, 2024

astrojuanlu commented Feb 4, 2024

Update, Upsert/Merge, Delete

deepyaman commented Feb 5, 2024

astrojuanlu commented Feb 5, 2024

astrojuanlu commented Feb 7, 2024 •

edited

Loading

inigohidalgo commented Feb 28, 2024

inigohidalgo commented Feb 28, 2024

astrojuanlu commented Feb 28, 2024 •

edited

Loading

takikadiri commented Mar 30, 2024

datajoely commented Apr 2, 2024

astrojuanlu commented Apr 2, 2024

astrojuanlu commented Apr 2, 2024

astrojuanlu commented Jun 15, 2024

astrojuanlu commented Jun 15, 2024 •

edited

Loading

datajoely commented Jun 17, 2024 •

edited

Loading

astrojuanlu commented Jun 17, 2024

datajoely commented Jun 17, 2024

astrojuanlu commented Jun 17, 2024

astrojuanlu commented Jul 14, 2024

datajoely commented Jul 15, 2024

astrojuanlu commented Jul 30, 2024

astrojuanlu commented Jul 30, 2024

astrojuanlu commented Oct 22, 2024

[spike] Investigate suitability of Kedro for EL pipelines and incremental data loading #3578

[spike] Investigate suitability of Kedro for EL pipelines and incremental data loading #3578

Comments

astrojuanlu commented Jan 30, 2024 • edited Loading

Intro and context

Data pipelines

Kedro for data pipelines

Next steps

datajoely commented Jan 30, 2024

astrojuanlu commented Feb 4, 2024

Update, Upsert/Merge, Delete

deepyaman commented Feb 5, 2024

astrojuanlu commented Feb 5, 2024

astrojuanlu commented Feb 7, 2024 • edited Loading

Footnotes

inigohidalgo commented Feb 28, 2024

inigohidalgo commented Feb 28, 2024

astrojuanlu commented Feb 28, 2024 • edited Loading

takikadiri commented Mar 30, 2024

datajoely commented Apr 2, 2024

astrojuanlu commented Apr 2, 2024

astrojuanlu commented Apr 2, 2024

astrojuanlu commented Jun 15, 2024

astrojuanlu commented Jun 15, 2024 • edited Loading

datajoely commented Jun 17, 2024 • edited Loading

astrojuanlu commented Jun 17, 2024

datajoely commented Jun 17, 2024

astrojuanlu commented Jun 17, 2024

astrojuanlu commented Jul 14, 2024

datajoely commented Jul 15, 2024

astrojuanlu commented Jul 30, 2024

astrojuanlu commented Jul 30, 2024

astrojuanlu commented Oct 22, 2024

astrojuanlu commented Jan 30, 2024 •

edited

Loading

astrojuanlu commented Feb 7, 2024 •

edited

Loading

astrojuanlu commented Feb 28, 2024 •

edited

Loading

astrojuanlu commented Jun 15, 2024 •

edited

Loading

datajoely commented Jun 17, 2024 •

edited

Loading