Modular Pipelines #220

yetudada · 2020-02-14T16:07:05Z

Description

We've seen something incredible evolve through continued use of Kedro. Teams around the world are starting to use Kedro to create stores of reusable pipelines.

Last year, we introduced basic support for Modular Pipelines and this year we're doubling down on this area.

In our world, a modular pipeline is a series of generalised and connected Python functions that have inputs and outputs. A modular pipeline:

Can be easily added to an existing or new Kedro project
Has virtually no learning curve, if you know how to use Kedro
Can be testable by itself, to ensure high quality code
Does not have a Kedro version dependency (related to Framework Redesign #219)

Context

The final evolution of Modular Pipelines will see an ecosystem of reusable pipelines. However, for now we want to focus on allowing users to easily add pre-assembled pipelines to an existing or new Kedro project and export their own pre-assembled pipelines.

Next steps

Give us feedback if you've tried Modular Pipelines and the basic support we have for using them, like pipeline.transform(). Modular Pipelines also have implications for kedro-viz and we can't wait to show you what we have in mind for this.

The text was updated successfully, but these errors were encountered:

EigenJT · 2020-04-15T15:54:28Z

@yetudada Not sure if this is where you'd like the feedback, but this is essentially how we've been building all our pipelines. One of the sticking points I've found is how to write tests that ensure the pipelines work within a kedro context. What I've resorted to doing is writing tests that create temporary kedro projects, then test the pipelines within them.

lorenabalan · 2020-04-15T16:08:48Z

Hi @EigenJT , thank you so much for sharing your feedback!! Could you maybe elaborate a bit on what you mean by ensure the pipelines work within a kedro context?

Is that end-to-end, from loading fake data to context identifying the right pipeline, and running the nodes with the right inputs/outputs? Your tests sound like integration tests rather than unit tests to me, correct me if I'm wrong? Are you testing that different variations of a pipeline (with pipeline.transform()) are ran successfully and as expected?

EigenJT · 2020-04-15T17:03:44Z

Hi @lorenabalan, yup integration tests is the best description. We write unit tests for the nodes, but to ensure the pipeline actually works as intended, we create fake data, fake parameters and a fake catalog, try the pipeline, evaluate the results, then tear the whole thing down. Essentially testing the results of a kedro run command.

We haven't tried pipeline.transform(), didn't know it existed to be honest. I think we've done something similar by creating pipeline classes that can be initialized with specific inputs and outputs.

lorenabalan · 2020-04-16T13:15:50Z

My bad, when you said "this is essentially how we've been building all our pipelines" I thought you meant with pipeline.transform() mentioned in the issue.
The problem of end-to-end testing has come up before in discussions, but would be great to probe you a bit deeper as well: does this end-to-end testing extend beyond what would be covered by schema validation and lots of granular unit tests on individual nodes (with mock data), and if yes, what additional behaviour are you interested in having covered?
It's an interesting problem, the more use cases we know of the better we can help. :)

Edit: Also maybe worth taking a peek at the docs in develop, I suspect something like this could make your tests easier to reason about: https://kedro.readthedocs.io/en/latest/04_user_guide/06_pipelines.html#using-a-custom-runner

EigenJT · 2020-04-16T17:15:20Z

Edit: Also maybe worth taking a peek at the docs in develop, I suspect something like this could make your tests easier to reason about: https://kedro.readthedocs.io/en/latest/04_user_guide/06_pipelines.html#using-a-custom-runner

@lorenabalan Ah yeah that would make things much easier.

Regarding the additional end-to-end testing: ensuring that the pipeline actually performs what it was supposed to do. So given fake inputs, validate that the resulting outputs are exactly what they're supposed to be. So a data test, in short. As an example, a pipeline that's dedicated to reformatting a certain filetype would be have its final output tested against a known output (down to the individual values, as well as the schema)

As an aside, one issue I've run in to is that when pipelines are written in isolation but then added together, the resulting pipeline can behave in an unexpected manner (the order in which nodes are run can change, for example). Not so much a test, but it would be interesting to have some way of requiring each modular pipeline to complete before another is kicked off. Maybe something like

total_pipeline = combine_pipelines([pipeline_1,pipeline_2,pipeline_3...], enforce_order = True)

total_pipeline would have to execute every node in pipeline_1 before starting on pipeline_2.

lorenabalan · 2020-04-21T09:49:52Z

You're right in that order is not necessarily guaranteed, though that should only be at tie-level (nodes with the same number of dependencies), as we leave it to toposort to figure out the order of execution of the DAG. If there's significant impact if 2 "parallel" nodes run in different order, maybe there's a missing link there?
Provided you have give the right dependencies (node inputs and outputs), you can enforce the order that way, e.g. creating an explicit link between the end of pipeline_1 to the beginning of pipeline_2. One of the main reasons we've stuck with this approach is because it's much easier to understand and follow "what runs first", for instance when you look at Viz, rather than holding that information in your head. Basically explicit over implicit.
Let me know if any of this doesn't make sense.

lorenabalan · 2020-04-21T09:55:45Z

Also as an update (for whoever is interested), we're looking to include this feature in the next breaking release (0.16.0). We've merged pipeline() helper here, a slightly cleaner alternative to Pipeline.transform(), which we're dropping, to map inputs/outputs/parameters names, or namespace (prefix) datasets and node names.
There's work being done on the CLI side to help with the workflow of creating/working with modular pipelines. This includes generating a new pipeline, packaging an existing pipeline, and pulling an existing pipeline from somewhere, integrating it into a Kedro project.

EigenJT · 2020-04-21T16:47:10Z

You're right in that order is not necessarily guaranteed, though that should only be at tie-level (nodes with the same number of dependencies), as we leave it to toposort to figure out the order of execution of the DAG. If there's significant impact if 2 "parallel" nodes run in different order, maybe there's a missing link there?
Provided you have give the right dependencies (node inputs and outputs), you can enforce the order that way, e.g. creating an explicit link between the end of pipeline_1 to the beginning of pipeline_2. One of the main reasons we've stuck with this approach is because it's much easier to understand and follow "what runs first", for instance when you look at Viz, rather than holding that information in your head. Basically explicit over implicit.
Let me know if any of this doesn't make sense.

All makes sense, and we've been making those explicit links between pipelines on our end to ensure that things run as expected. The modular_pipeline looks pretty cool!

yetudada · 2021-03-08T15:53:03Z

Hi @EigenJT! We hope you've been able to make use of the new modular pipeline workflow. We're going to close this issue as part of our GitHub issue clean up but please do comment to re-open this issue or create a new one based on your requirements.

yetudada added Issue: Feature Request New feature or improvement to existing feature Type: Opportunity Roadmap labels Feb 14, 2020

yetudada assigned lorenabalan Feb 14, 2020

yetudada removed the Issue: Feature Request New feature or improvement to existing feature label Feb 14, 2020

yetudada closed this as completed Mar 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modular Pipelines #220

Modular Pipelines #220

yetudada commented Feb 14, 2020 •

edited by lorenabalan

Loading

EigenJT commented Apr 15, 2020 •

edited

Loading

lorenabalan commented Apr 15, 2020

EigenJT commented Apr 15, 2020 •

edited

Loading

lorenabalan commented Apr 16, 2020 •

edited

Loading

EigenJT commented Apr 16, 2020 •

edited

Loading

lorenabalan commented Apr 21, 2020

lorenabalan commented Apr 21, 2020 •

edited

Loading

EigenJT commented Apr 21, 2020

yetudada commented Mar 8, 2021

Modular Pipelines #220

Modular Pipelines #220

Comments

yetudada commented Feb 14, 2020 • edited by lorenabalan Loading

Description

Context

Next steps

EigenJT commented Apr 15, 2020 • edited Loading

lorenabalan commented Apr 15, 2020

EigenJT commented Apr 15, 2020 • edited Loading

lorenabalan commented Apr 16, 2020 • edited Loading

EigenJT commented Apr 16, 2020 • edited Loading

lorenabalan commented Apr 21, 2020

lorenabalan commented Apr 21, 2020 • edited Loading

EigenJT commented Apr 21, 2020

yetudada commented Mar 8, 2021

yetudada commented Feb 14, 2020 •

edited by lorenabalan

Loading

EigenJT commented Apr 15, 2020 •

edited

Loading

EigenJT commented Apr 15, 2020 •

edited

Loading

lorenabalan commented Apr 16, 2020 •

edited

Loading

EigenJT commented Apr 16, 2020 •

edited

Loading

lorenabalan commented Apr 21, 2020 •

edited

Loading