-
Notifications
You must be signed in to change notification settings - Fork 893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimise pipeline addition and creation #3730
Conversation
Signed-off-by: Ivan Danov <idanov@users.noreply.github.com>
Signed-off-by: Ivan Danov <idanov@users.noreply.github.com>
Signed-off-by: Ivan Danov <idanov@users.noreply.github.com>
Signed-off-by: Ivan Danov <idanov@users.noreply.github.com>
Signed-off-by: Ivan Danov <idanov@users.noreply.github.com>
Signed-off-by: Ivan Danov <idanov@users.noreply.github.com>
Signed-off-by: Ivan Danov <idanov@users.noreply.github.com>
Signed-off-by: Ivan Danov <idanov@users.noreply.github.com>
Signed-off-by: Ivan Danov <idanov@users.noreply.github.com>
Signed-off-by: Ivan Danov <idanov@users.noreply.github.com>
Awesome job, huge improvement 👏🏻 As for:
That change would be really unfortunate, because the flow of having a hook that changes the Examples: |
This is a side conversation, not related to the PR, but responding to:
The immutability change will be a breaking change unfortunately, so unlikely to happen soon. Nevertheless we can make them The introduction of mutability was already a mistake we should've avoided in a first place. Immutable objects is one of the best ways to ensure that you can pass around a node without copying and make the code safe and bug free. There are different patterns we can apply in order to address your use cases without needing mutability. The current pattern is quite unsafe, e.g. a plugin can attach a completely different function, as there are no validations applied. Moreover, it is the only mutable method there, e.g. if you apply new tags, you get a new copy of the node and you don't modify the current node. It's completely out of place from the current functioning and idea of the nodes and what makes a node node, and not just a function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me.
During some investigation I have done for #3575, I found that Pipeline
is created multiple times so avoiding heavy operation during Pipeline creation time should definitely helps.
i.e. kedro run
first create a Pipeline
, and Pipeline.filter
will create yet another Pipeline
, the factory pipeline
method also create mulitple Pipeline
object in between, those are the things that I would look into next. (or if we make the cost of Pipeline
creation neglectable)
The discussion of attrs
and immutability should goes to a separate issue.
Signed-off-by: Ivan Danov <idanov@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for this improvement @idanov ⭐ Great to see how relatively few changes already improve the performance so much 👍
* Create toposort groups only when needed * Ensure that the suggest resume test has no node ordering requirement * Ensure stable toposorting by grouping and ungrouping the result * Delay toposorting until pipeline.nodes is used * Avoid using .nodes when topological order or new copy is unneeded * Copy the nodes only if tags are provided * Remove unnecessary condition in self.nodes Signed-off-by: Ivan Danov <idanov@users.noreply.github.com> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com>
) * Update kedro-catalog-0.19.json (#3724) * Update kedro-catalog-0.19.json Signed-off-by: Anthony De Bortoli <anthony.debortoli@protonmail.com> * Update set_up_vscode.md Signed-off-by: Anthony De Bortoli <anthony.debortoli@protonmail.com> --------- Signed-off-by: Anthony De Bortoli <anthony.debortoli@protonmail.com> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Update project tests directory structure in docs Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Add docs on writing tests Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Drop dependency on toposort in favour of built-in graphlib (#3728) * Replace toposort with graphlib (built-in from Python 3.9) Signed-off-by: Ivan Danov <idanov@users.noreply.github.com> * Create toposort groups only when needed Signed-off-by: Ivan Danov <idanov@users.noreply.github.com> * Update RELEASE.md and graphlib version constraints Signed-off-by: Ivan Danov <idanov@users.noreply.github.com> * Remove mypy-toposort Signed-off-by: Ivan Danov <idanov@users.noreply.github.com> * Ensure that the suggest resume test has no node ordering requirement Signed-off-by: Ivan Danov <idanov@users.noreply.github.com> * Ensure stable toposorting by grouping and ungrouping the result Signed-off-by: Ivan Danov <idanov@users.noreply.github.com> --------- Signed-off-by: Ivan Danov <idanov@users.noreply.github.com> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Optimise pipeline addition and creation (#3730) * Create toposort groups only when needed * Ensure that the suggest resume test has no node ordering requirement * Ensure stable toposorting by grouping and ungrouping the result * Delay toposorting until pipeline.nodes is used * Avoid using .nodes when topological order or new copy is unneeded * Copy the nodes only if tags are provided * Remove unnecessary condition in self.nodes Signed-off-by: Ivan Danov <idanov@users.noreply.github.com> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Expand robots.txt for Kedro-Viz and Kedro-Datasets docs (#3729) * Add project to robots.txt Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com> * Add EOF Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com> --------- Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com> Co-authored-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Kedro need more uv (#3740) * Kedro need more uv Signed-off-by: Nok <nok.lam.chan@quantumblack.com> * remove docker Signed-off-by: Nok <nok.lam.chan@quantumblack.com> --------- Signed-off-by: Nok <nok.lam.chan@quantumblack.com> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Resolve all path in Kedro (#3742) * Kedro need more uv Signed-off-by: Nok <nok.lam.chan@quantumblack.com> * remove docker Signed-off-by: Nok <nok.lam.chan@quantumblack.com> * fix broken type hint and resolve project path Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com> * fix type hint Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com> * remove duplicate logic Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com> * adding nok.py is definitely an accident Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com> * fix test Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com> * remove print Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com> * add test Signed-off-by: Nok <nok.lam.chan@quantumblack.com> --------- Signed-off-by: Nok <nok.lam.chan@quantumblack.com> Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Remove settings of rate limits and retries (#3769) * double linkcheck limits Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com> * fix ratelimit Signed-off-by: Nok <nok.lam.chan@quantumblack.com> --------- Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com> Signed-off-by: Nok <nok.lam.chan@quantumblack.com> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Improve resume suggestions (#3719) * Improve suggestions to resume a failed pipeline - if dataset (or param) is persistent & shared, don't keep looking for ancestors - only look for ancestors producing impersistent inputs - minimize number of suggested nodes (= shorter message for the same pipeline) - testable logic, tests cases outside of scenarios for sequential runner - Use _EPHEMERAL attribute - Move tests to separate file - Docstring updates --------- Signed-off-by: Ondrej Zacha <ondrej.zacha@okra.ai> Co-authored-by: Nok Lam Chan <nok.lam.chan@quantumblack.com> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Build docs fix (#3773) * Ignored forbidden url Signed-off-by: Elena Khaustova <ymax70rus@gmail.com> * Returned linkscheck retries Signed-off-by: Elena Khaustova <ymax70rus@gmail.com> * Removed odd comment Signed-off-by: Elena Khaustova <ymax70rus@gmail.com> --------- Signed-off-by: Elena Khaustova <ymax70rus@gmail.com> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Clarify docs around custom resolvers (#3759) * Updated custom resolver docs section Signed-off-by: Elena Khaustova <ymax70rus@gmail.com> * Updated advanced configuration section for consistency Signed-off-by: Elena Khaustova <ymax70rus@gmail.com> * Updated RELEASE.md Signed-off-by: Elena Khaustova <ymax70rus@gmail.com> * Updated RELEASE.md Signed-off-by: Elena Khaustova <ymax70rus@gmail.com> * Test linkcheck_workers decrease Signed-off-by: Elena Khaustova <ymax70rus@gmail.com> * Increased the By default, the linkcheck_rate_limit_timeout to default Signed-off-by: Elena Khaustova <ymax70rus@gmail.com> * Returned old docs build settings Signed-off-by: Elena Khaustova <ymax70rus@gmail.com> * Fixed typo Signed-off-by: Elena Khaustova <ymax70rus@gmail.com> * Ignore forbidden url Signed-off-by: Elena Khaustova <ymax70rus@gmail.com> * Returned linkcheck retries Signed-off-by: Elena Khaustova <ymax70rus@gmail.com> --------- Signed-off-by: Elena Khaustova <ymax70rus@gmail.com> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Add mlruns to gitignore to avoid pushing mlflow local runs to github (#3765) * Add mlruns to gitignore to avoid pushing mlflow local runs to github Signed-off-by: Yolan Honoré-Rougé <yolan.honore.rouge@gmail.com> * update release.md Signed-off-by: Yolan Honoré-Rougé <yolan.honore.rouge@gmail.com> --------- Signed-off-by: Yolan Honoré-Rougé <yolan.honore.rouge@gmail.com> Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Co-authored-by: Nok Lam Chan <nok.lam.chan@quantumblack.com> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Update the dependencies page in the docs (#3772) * Update the dependencies page Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> * Update docs/source/kedro_project_setup/dependencies.md Signed-off-by: Ankita Katiyar <110245118+ankatiyar@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> Signed-off-by: Ankita Katiyar <110245118+ankatiyar@users.noreply.github.com> * Fix lint Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> * Move the last line to notes Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> --------- Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> Signed-off-by: Ankita Katiyar <110245118+ankatiyar@users.noreply.github.com> Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Change pipeline test location to project root/tests (#3731) * Change pipeline test location to project root/tests Signed-off-by: lrcouto <laurarccouto@gmail.com> * Fix some test_pipeline tests Signed-off-by: lrcouto <laurarccouto@gmail.com> * Change delete pipeline to account for new structure Signed-off-by: lrcouto <laurarccouto@gmail.com> * Fix some tests Signed-off-by: lrcouto <laurarccouto@gmail.com> * Change tests path on micropkg Signed-off-by: lrcouto <laurarccouto@gmail.com> * Fix remaining tests Signed-off-by: lrcouto <laurarccouto@gmail.com> * Add changes to release notes Signed-off-by: lrcouto <laurarccouto@gmail.com> * Update file structure on micropackaging doc page Signed-off-by: lrcouto <laurarccouto@gmail.com> --------- Signed-off-by: lrcouto <laurarccouto@gmail.com> Signed-off-by: L. R. Couto <57910428+lrcouto@users.noreply.github.com> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Add an option for kedro new to skip telemetry (#3701) * First draft for telemetry consent flag on kedro new Signed-off-by: lrcouto <laurarccouto@gmail.com> * Add functioning --telemetry option to kedro new Signed-off-by: lrcouto <laurarccouto@gmail.com> * Update tests to acknowledge new flag Signed-off-by: lrcouto <laurarccouto@gmail.com> * Add tests for kedro new --telemetry flag Signed-off-by: lrcouto <laurarccouto@gmail.com> * Add changes to documentation and release notes Signed-off-by: lrcouto <laurarccouto@gmail.com> * Minor change to docs Signed-off-by: lrcouto <laurarccouto@gmail.com> * Lint Signed-off-by: lrcouto <laurarccouto@gmail.com> * Remove outdated comment and correct type hint Signed-off-by: lrcouto <laurarccouto@gmail.com> * Update docs/source/get_started/new_project.md Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Signed-off-by: L. R. Couto <57910428+lrcouto@users.noreply.github.com> * Lint Signed-off-by: lrcouto <laurarccouto@gmail.com> * Minor change on release note Signed-off-by: lrcouto <laurarccouto@gmail.com> --------- Signed-off-by: lrcouto <laurarccouto@gmail.com> Signed-off-by: L. R. Couto <57910428+lrcouto@users.noreply.github.com> Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Update documentation for OmegaConfigLoader (#3778) * Update documentation for OmegaConfigLoader Signed-off-by: Puneet Saini <99470400+puneeter@users.noreply.github.com> * Update RELEASE.md Signed-off-by: Puneet Saini <99470400+puneeter@users.noreply.github.com> * Update RELEASE.md Signed-off-by: Puneet Saini <99470400+puneeter@users.noreply.github.com> * Update ignore-names.txt Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com> --------- Signed-off-by: Puneet Saini <99470400+puneeter@users.noreply.github.com> Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com> Co-authored-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Fix path Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Lint Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Add changes to RELEASE.md Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Address comments from code review Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Empty Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Remove unneeded imports Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Change recommendation from pytest config to editable install Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Add negative testing example Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Replace Dict with dict Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Remove test classes Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Change the assert step for the integration test Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Fix error handling for OmegaConfigLoader (#3784) * Update omegaconf_config.py Signed-off-by: Puneet Saini <99470400+puneeter@users.noreply.github.com> * Update RELEASE.md Signed-off-by: Puneet Saini <99470400+puneeter@users.noreply.github.com> * add a more complicated test case Signed-off-by: Nok <nok.lam.chan@quantumblack.com> --------- Signed-off-by: Puneet Saini <99470400+puneeter@users.noreply.github.com> Signed-off-by: Nok <nok.lam.chan@quantumblack.com> Co-authored-by: Nok <nok.lam.chan@quantumblack.com> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Add Simon Brugman to TSC (#3780) Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Update technical_steering_committee.md (#3796) Signed-off-by: Marcin Zabłocki <m.zablo@gmail.com> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Remove jmespath dependency (#3797) Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Update spaceflights tutorial and starter requirements for kedro-datasets optional dependencies (#3664) * Update spaceflights tutorial and starter requirements Signed-off-by: lrcouto <laurarccouto@gmail.com> * fix e2e tests Signed-off-by: lrcouto <laurarccouto@gmail.com> * Fix e2e tests by distinguishing `kedro-datasets` dependency for different python versions (#3802) Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> * Update docs/source/tutorial/tutorial_template.md Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Signed-off-by: L. R. Couto <57910428+lrcouto@users.noreply.github.com> --------- Signed-off-by: lrcouto <laurarccouto@gmail.com> Signed-off-by: L. R. Couto <57910428+lrcouto@users.noreply.github.com> Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Consider Vale's suggestions Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Hide test in details Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Quick fix Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Typo (and wording changes) Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Update robots.txt (#3803) Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> Co-authored-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Add a test for transcoding loops of 1 or more nodes (#3810) Signed-off-by: Ivan Danov <idanov@users.noreply.github.com> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Ensure no nodes can depend on themselves even when transcoding is used (#3812) * Factor out transcoding helpers into a private module Signed-off-by: Ivan Danov <idanov@users.noreply.github.com> * Ensure node input/output validation doesn't allow transcoded self-loops Signed-off-by: Ivan Danov <idanov@users.noreply.github.com> * Updated release note to avoid github warning Signed-off-by: Elena Khaustova <ymax70rus@gmail.com> --------- Signed-off-by: Ivan Danov <idanov@users.noreply.github.com> Signed-off-by: Elena Khaustova <ymax70rus@gmail.com> Co-authored-by: Elena Khaustova <ymax70rus@gmail.com> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Update UUID telemetry docs (#3805) Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Change path to starters test (#3816) Signed-off-by: lrcouto <laurarccouto@gmail.com> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Move changes in RELEASE.md to docs section Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Change formatting Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Revert "Change formatting" This reverts commit 9582a22. Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Apply changes from code review Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Add explanation on why cleanup isn't needed Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Change assert on successful pipeline to check logs Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Update description of integration test under pipeline slicing Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Missing formatting Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> * Update tests directory structure Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> --------- Signed-off-by: Anthony De Bortoli <anthony.debortoli@protonmail.com> Signed-off-by: Ahdra Merali <ahdra.merali@quantumblack.com> Signed-off-by: Ivan Danov <idanov@users.noreply.github.com> Signed-off-by: Dmitry Sorokin <dmd40in@gmail.com> Signed-off-by: Nok <nok.lam.chan@quantumblack.com> Signed-off-by: Nok Lam Chan <nok.lam.chan@quantumblack.com> Signed-off-by: Ondrej Zacha <ondrej.zacha@okra.ai> Signed-off-by: Elena Khaustova <ymax70rus@gmail.com> Signed-off-by: Yolan Honoré-Rougé <yolan.honore.rouge@gmail.com> Signed-off-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Signed-off-by: Ankita Katiyar <ankitakatiyar2401@gmail.com> Signed-off-by: Ankita Katiyar <110245118+ankatiyar@users.noreply.github.com> Signed-off-by: lrcouto <laurarccouto@gmail.com> Signed-off-by: L. R. Couto <57910428+lrcouto@users.noreply.github.com> Signed-off-by: Puneet Saini <99470400+puneeter@users.noreply.github.com> Signed-off-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com> Signed-off-by: Marcin Zabłocki <m.zablo@gmail.com> Signed-off-by: Merel Theisen <merel.theisen@quantumblack.com> Signed-off-by: Ahdra Merali <90615669+AhdraMeraliQB@users.noreply.github.com> Co-authored-by: Anthony De Bortoli <anthony.debortoli@protonmail.com> Co-authored-by: Ivan Danov <idanov@users.noreply.github.com> Co-authored-by: Dmitry Sorokin <40151847+DimedS@users.noreply.github.com> Co-authored-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com> Co-authored-by: Nok Lam Chan <nok.lam.chan@quantumblack.com> Co-authored-by: Ondrej Zacha <ondrej.zacha@okra.ai> Co-authored-by: ElenaKhaustova <157851531+ElenaKhaustova@users.noreply.github.com> Co-authored-by: Yolan Honoré-Rougé <29451317+Galileo-Galilei@users.noreply.github.com> Co-authored-by: Merel Theisen <49397448+merelcht@users.noreply.github.com> Co-authored-by: Ankita Katiyar <110245118+ankatiyar@users.noreply.github.com> Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> Co-authored-by: L. R. Couto <57910428+lrcouto@users.noreply.github.com> Co-authored-by: Puneet Saini <99470400+puneeter@users.noreply.github.com> Co-authored-by: Marcin Zabłocki <m.zablo@gmail.com> Co-authored-by: Elena Khaustova <ymax70rus@gmail.com>
Description
Creating large pipelines in Kedro is very slow and can take tens of seconds as reported in #3167
After some investigation, it turned out that number of factors contributed to that:
self.nodes
within thePipeline
class triggers many unneeded copies (original PRs: Don't toposort nodes in non-user-facing operations #3146)Pipeline
objectPipeline
object creation, caused by each node being copied and revalidatedThis PR addresses the first two completely, while it addresses the third one partially only when no new tags are added.
The first one was addressed by #3146, but it's still not merged yet.
Development notes
For testing, @marrrcin 's test from #3167 was used and https://pyinstrument.readthedocs.io/en/latest/ profiling was run before and after.
After the changes from this PR and #3728, we've reduced the time it takes to sum 51 pipelines from ~15s down to ~6s, which is about 60% reduction in time. All of that was tested on Python 3.8 with the
graphlib
backport, it's possible that the built-ingraphlib
is much faster than the backport and might yield better results.Further improvements could be done by removing unnecessary
set()
andlist()
operations, doing a lightweight check for cycles without the need of instantiatinggraphlib.TopologicalSorter
upon init and potentially makingNode
andPipeline
use attrs. The latter will help ensuring that they remain immutable, as apparently a previous contribution snuck-in mutability to theNode
class which is against the idea of stateless nodes:kedro/kedro/pipeline/node.py
Lines 231 to 239 in 0fc8089
Before:
After:
Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a
Signed-off-by
line in the commit message. See our wiki for guidance.If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
RELEASE.md
file