Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modular Pipelines #220

Closed
yetudada opened this issue Feb 14, 2020 · 9 comments
Closed

Modular Pipelines #220

yetudada opened this issue Feb 14, 2020 · 9 comments
Assignees

Comments

@yetudada
Copy link
Contributor

yetudada commented Feb 14, 2020

Description

We've seen something incredible evolve through continued use of Kedro. Teams around the world are starting to use Kedro to create stores of reusable pipelines.

Last year, we introduced basic support for Modular Pipelines and this year we're doubling down on this area.

In our world, a modular pipeline is a series of generalised and connected Python functions that have inputs and outputs. A modular pipeline:

  • Can be easily added to an existing or new Kedro project
  • Has virtually no learning curve, if you know how to use Kedro
  • Can be testable by itself, to ensure high quality code
  • Does not have a Kedro version dependency (related to Framework Redesign #219)

Context

The final evolution of Modular Pipelines will see an ecosystem of reusable pipelines. However, for now we want to focus on allowing users to easily add pre-assembled pipelines to an existing or new Kedro project and export their own pre-assembled pipelines.

Next steps

Give us feedback if you've tried Modular Pipelines and the basic support we have for using them, like pipeline.transform(). Modular Pipelines also have implications for kedro-viz and we can't wait to show you what we have in mind for this.

@yetudada yetudada added Issue: Feature Request New feature or improvement to existing feature Type: Opportunity Roadmap labels Feb 14, 2020
@yetudada yetudada removed the Issue: Feature Request New feature or improvement to existing feature label Feb 14, 2020
@EigenJT
Copy link

EigenJT commented Apr 15, 2020

@yetudada Not sure if this is where you'd like the feedback, but this is essentially how we've been building all our pipelines. One of the sticking points I've found is how to write tests that ensure the pipelines work within a kedro context. What I've resorted to doing is writing tests that create temporary kedro projects, then test the pipelines within them.

@lorenabalan
Copy link
Contributor

Hi @EigenJT , thank you so much for sharing your feedback!! Could you maybe elaborate a bit on what you mean by ensure the pipelines work within a kedro context?

Is that end-to-end, from loading fake data to context identifying the right pipeline, and running the nodes with the right inputs/outputs? Your tests sound like integration tests rather than unit tests to me, correct me if I'm wrong? Are you testing that different variations of a pipeline (with pipeline.transform()) are ran successfully and as expected?

@EigenJT
Copy link

EigenJT commented Apr 15, 2020

Hi @lorenabalan, yup integration tests is the best description. We write unit tests for the nodes, but to ensure the pipeline actually works as intended, we create fake data, fake parameters and a fake catalog, try the pipeline, evaluate the results, then tear the whole thing down. Essentially testing the results of a kedro run command.

We haven't tried pipeline.transform(), didn't know it existed to be honest. I think we've done something similar by creating pipeline classes that can be initialized with specific inputs and outputs.

@lorenabalan
Copy link
Contributor

lorenabalan commented Apr 16, 2020

My bad, when you said "this is essentially how we've been building all our pipelines" I thought you meant with pipeline.transform() mentioned in the issue.
The problem of end-to-end testing has come up before in discussions, but would be great to probe you a bit deeper as well: does this end-to-end testing extend beyond what would be covered by schema validation and lots of granular unit tests on individual nodes (with mock data), and if yes, what additional behaviour are you interested in having covered?
It's an interesting problem, the more use cases we know of the better we can help. :)

Edit: Also maybe worth taking a peek at the docs in develop, I suspect something like this could make your tests easier to reason about: https://kedro.readthedocs.io/en/latest/04_user_guide/06_pipelines.html#using-a-custom-runner

@EigenJT
Copy link

EigenJT commented Apr 16, 2020

Edit: Also maybe worth taking a peek at the docs in develop, I suspect something like this could make your tests easier to reason about: https://kedro.readthedocs.io/en/latest/04_user_guide/06_pipelines.html#using-a-custom-runner

@lorenabalan Ah yeah that would make things much easier.

Regarding the additional end-to-end testing: ensuring that the pipeline actually performs what it was supposed to do. So given fake inputs, validate that the resulting outputs are exactly what they're supposed to be. So a data test, in short. As an example, a pipeline that's dedicated to reformatting a certain filetype would be have its final output tested against a known output (down to the individual values, as well as the schema)

As an aside, one issue I've run in to is that when pipelines are written in isolation but then added together, the resulting pipeline can behave in an unexpected manner (the order in which nodes are run can change, for example). Not so much a test, but it would be interesting to have some way of requiring each modular pipeline to complete before another is kicked off. Maybe something like

total_pipeline = combine_pipelines([pipeline_1,pipeline_2,pipeline_3...], enforce_order = True)

total_pipeline would have to execute every node in pipeline_1 before starting on pipeline_2.

@lorenabalan
Copy link
Contributor

You're right in that order is not necessarily guaranteed, though that should only be at tie-level (nodes with the same number of dependencies), as we leave it to toposort to figure out the order of execution of the DAG. If there's significant impact if 2 "parallel" nodes run in different order, maybe there's a missing link there?
Provided you have give the right dependencies (node inputs and outputs), you can enforce the order that way, e.g. creating an explicit link between the end of pipeline_1 to the beginning of pipeline_2. One of the main reasons we've stuck with this approach is because it's much easier to understand and follow "what runs first", for instance when you look at Viz, rather than holding that information in your head. Basically explicit over implicit.
Let me know if any of this doesn't make sense.

@lorenabalan
Copy link
Contributor

lorenabalan commented Apr 21, 2020

Also as an update (for whoever is interested), we're looking to include this feature in the next breaking release (0.16.0). We've merged pipeline() helper here, a slightly cleaner alternative to Pipeline.transform(), which we're dropping, to map inputs/outputs/parameters names, or namespace (prefix) datasets and node names.
There's work being done on the CLI side to help with the workflow of creating/working with modular pipelines. This includes generating a new pipeline, packaging an existing pipeline, and pulling an existing pipeline from somewhere, integrating it into a Kedro project.

@EigenJT
Copy link

EigenJT commented Apr 21, 2020

You're right in that order is not necessarily guaranteed, though that should only be at tie-level (nodes with the same number of dependencies), as we leave it to toposort to figure out the order of execution of the DAG. If there's significant impact if 2 "parallel" nodes run in different order, maybe there's a missing link there?
Provided you have give the right dependencies (node inputs and outputs), you can enforce the order that way, e.g. creating an explicit link between the end of pipeline_1 to the beginning of pipeline_2. One of the main reasons we've stuck with this approach is because it's much easier to understand and follow "what runs first", for instance when you look at Viz, rather than holding that information in your head. Basically explicit over implicit.
Let me know if any of this doesn't make sense.

All makes sense, and we've been making those explicit links between pipelines on our end to ensure that things run as expected. The modular_pipeline looks pretty cool!

@yetudada
Copy link
Contributor Author

yetudada commented Mar 8, 2021

Hi @EigenJT! We hope you've been able to make use of the new modular pipeline workflow. We're going to close this issue as part of our GitHub issue clean up but please do comment to re-open this issue or create a new one based on your requirements.

@yetudada yetudada closed this as completed Mar 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants