Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't toposort nodes in non-user-facing operations #3146

Closed
wants to merge 14 commits into from

Conversation

deepyaman
Copy link
Member

@deepyaman deepyaman commented Oct 9, 2023

NOTE: Kedro datasets are moving from kedro.extras.datasets to a separate kedro-datasets package in
kedro-plugins repository. Any changes to the dataset implementations
should be done by opening a pull request in that repository.

Description

https://kedro-org.slack.com/archives/C054U0LJLAC/p1696862244884179

Development notes

Docs failure fixed by #3152

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Copy link
Member

@astrojuanlu astrojuanlu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make def nodes a cached_property instead? Are pipelines supposed to be mutable?

@deepyaman
Copy link
Member Author

deepyaman commented Oct 9, 2023

Why not make def nodes a cached_property instead?

I think we can (and quite possibly should), but I also think there's not even a need to sort once on a lot of these operations. Almost none of these operations are user-facing.

Edit: I've changed my mind; accessing nodes isn't that expensive, when a person does it. The issue is happening when we're unnecessarily accessing the toposorted version over and over again from our code.

We could still make it a cached property (depending on the answer to your other question), but I think whether it's a cached property or not doesn't actually make a significant difference for user behaviors, and just using the unsorted version is more efficient regardless for the motivating case (and a lot of the update operations in general).

Are pipelines supposed to be mutable?

I don't know; I suppose they could be? But I don't think anybody would suffer much if they weren't.

Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
Signed-off-by: Deepyaman Datta <deepyaman.datta@utexas.edu>
@deepyaman deepyaman marked this pull request as ready for review October 10, 2023 14:45
@deepyaman deepyaman self-assigned this Oct 10, 2023
@deepyaman deepyaman enabled auto-merge (squash) October 14, 2023 19:17
Copy link
Contributor

@marrrcin marrrcin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested it some time ago as part of #3167 - it contributed to a small (~5%) improvement.

@noklam
Copy link
Contributor

noklam commented Nov 17, 2023

Add more context, since the original slack thread is in a private channel I cannot get a linen copy.

I’ve stumped upon the performance issue with huge pipeline sums - it appears in the context of (obviously) dynamic pipelines / pipelines that are generated in loops.
It seems like the sum of pipelines is O(n^2) and it starts getting painful for our client 😞
In the chart n is the number of pipelines to sum, time is … time in seconds.

Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to understand this PR more. Is the description of this PR updated? AFAIK, nodes does not do any additional toposorts so I am not sure what's the fix for.

Does this actually helps performance or is just a clean up?

@noklam
Copy link
Contributor

noklam commented Nov 17, 2023

An archive of the discussion (terribly formatted)

marrrcin
1 month ago
I’ve stumped upon the performance issue with huge pipeline sums - it appears in the context of (obviously) dynamic pipelines / pipelines that are generated in loops.
It seems like the sum of pipelines is O(n^2) and it starts getting painful for our client 😞
In the chart n is the number of pipelines to sum, time is … time in seconds.
plot.png

plot.png

49 replies

marrrcin
1 month ago
Repro code (modified spaceflights starter)
def create_pipeline(**kwargs) -> Pipeline:
data_engineering_pipeline = pipeline(
[
node(
func=preprocess_companies,
inputs="companies",
outputs="preprocessed_companies",
name="preprocess_companies_node",
),
node(
func=preprocess_shuttles,
inputs="shuttles",
outputs="preprocessed_shuttles",
name="preprocess_shuttles_node",
),
node(
func=create_model_input_table,
inputs=["preprocessed_shuttles", "preprocessed_companies", "reviews"],
outputs="model_input_table",
name="create_model_input_table_node",
),
]
)

# TEST
import time
pipelines = []
MAX = 500
for i in range(MAX + 1):
    pipelines.append(
        pipeline(
            data_engineering_pipeline,
            inputs={"companies": "companies",
                    "shuttles": "shuttles",
                    "reviews": "reviews"},
            namespace=f"namespace_{i}",
        )
    )
data = []
for n in range(1, MAX, 10):
    start = time.monotonic()
    _ = sum(pipelines[:n])
    end = time.monotonic()
    print(f"Sum of {n} pipelines took: {end - start:0.3f}s")
    data.append((n, end - start))

import pandas as pd
df = pd.DataFrame(data, columns=["n", "time"])
df.plot.scatter(x="n", y="time").get_figure().savefig("plot.png")

marrrcin
1 month ago
Sum of 1 pipelines took: 0.000s
Sum of 11 pipelines took: 0.010s
Sum of 21 pipelines took: 0.036s
Sum of 31 pipelines took: 0.080s
Sum of 41 pipelines took: 0.145s
Sum of 51 pipelines took: 0.217s
Sum of 61 pipelines took: 0.300s
Sum of 71 pipelines took: 0.477s
Sum of 81 pipelines took: 0.549s
Sum of 91 pipelines took: 0.727s
Sum of 101 pipelines took: 0.868s
Sum of 111 pipelines took: 1.137s
Sum of 121 pipelines took: 1.311s
Sum of 131 pipelines took: 1.568s
Sum of 141 pipelines took: 1.824s
Sum of 151 pipelines took: 2.161s
Sum of 161 pipelines took: 2.836s
Sum of 171 pipelines took: 2.722s
Sum of 181 pipelines took: 3.906s
Sum of 191 pipelines took: 4.064s
Sum of 201 pipelines took: 4.644s
Sum of 211 pipelines took: 5.507s
Sum of 221 pipelines took: 6.610s
Sum of 231 pipelines took: 5.762s
Sum of 241 pipelines took: 6.654s
Sum of 251 pipelines took: 7.717s
Sum of 261 pipelines took: 8.446s
Sum of 271 pipelines took: 9.495s
Sum of 281 pipelines took: 10.725s
Sum of 291 pipelines took: 13.334s
Sum of 301 pipelines took: 12.023s
Sum of 311 pipelines took: 12.660s
Sum of 321 pipelines took: 12.584s
Sum of 331 pipelines took: 12.685s
Sum of 341 pipelines took: 13.855s
Sum of 351 pipelines took: 14.573s
Sum of 361 pipelines took: 15.114s
Sum of 371 pipelines took: 15.624s
Sum of 381 pipelines took: 16.656s
Sum of 391 pipelines took: 26.046s
Sum of 401 pipelines took: 25.595s
Sum of 411 pipelines took: 25.409s
Sum of 421 pipelines took: 23.397s
Sum of 431 pipelines took: 22.718s
Sum of 441 pipelines took: 23.516s
Sum of 451 pipelines took: 31.217s
Sum of 461 pipelines took: 47.941s
Sum of 471 pipelines took: 47.578s
Sum of 481 pipelines took: 37.139s
Sum of 491 pipelines took: 35.174s

Ivan Danov
1 month ago
The code for adding pipelines is this:
def add(self, other):
if not isinstance(other, Pipeline):
return NotImplemented
return Pipeline(set(self.nodes + other.nodes))

marrrcin
1 month ago
I think it’s toposort called N times that causes this

Ivan Danov
1 month ago
It's not the most efficient, but it shouldn't be quadratic... Unless something happens in Pipeline creation, e.g. the toposorting here might be challenging:
self._topo_sorted_nodes = _topologically_sorted(self.node_dependencies)

juanlu
1 month ago
ugh, good insight
@marrrcin

  • would you please open an issue about this? unlikely to affect small projects of course, but annoying enough for big projects we should be looking at it
    👍
    1

Ivan Danov
1 month ago
Well, summing more than 10 pipelines together is not a realistic scenario tbh...

datajoely
1 month ago
Well, summing more than 10 pipelines together is not a realistic scenario tbh...
Users will and do hit this
:this:
2

datajoely
1 month ago
I wonder if we can make toposort a bit more lazy

Ivan Danov
1 month ago
It's not the toposort thing... It's the way Python works

Ivan Danov
1 month ago
This one is summing two lists, try the same thing with integers and you will get the same result

marrrcin
1 month ago
We have a client who has ~20 instances of a single pipeline and ~40 pipelines in total (they are 100+ nodes) and kedro run --pipeline=only_single_pipeline takes 10+ minutes for them (to start executing) (edited)
😮
1

Ivan Danov
1 month ago
(we only add extra copying around)

Ivan Danov
1 month ago
Are you sure it's the pipelines, and not the config for those pipelines?

marrrcin
1 month ago
What do you mean by “config for those pipelines”?

datajoely
1 month ago
the reproducible example was above

Ivan Danov
1 month ago
@datajoely
the reproducible example shows that when you sum 100 pipelines, it takes less than 1s...

datajoely
1 month ago
good point!

marrrcin
1 month ago
I guess it also grows with nodes - I will do this test tomorrow - in client’s code base I’ve narrowed it down to the Pipeline constructor that causes the issue

Ivan Danov
1 month ago
@marrrcin
most of the time slowdowns happen due to large catalogs, not large pipelines
👍
1

Ivan Danov
1 month ago
have you checked if the same thing happens with an empty catalog?

marrrcin
1 month ago
Will do tomorrow!

juanlu
1 month ago
we don't need to speculate hopefully,
@marrrcin
if you could produce a flamegraph like #3033 (comment), we could have a much clear view of what's going on and I believe there would be no compromise of confidential data
image.png

image.png
👍
1

Ivan Danov
1 month ago
I think it's definitely worth making a Kedro optimisation sprint to make it faster wherever possible... A big problem has always been the config loading and catalog creation, especially with custom configs, because they tend to use the old .get API quite often, which is always reloading the config.
👍
1

marrrcin
1 month ago
It grows even faster when number of nodes is large (200 nodes):
Sum of 1 pipelines took: 0.000s
Sum of 11 pipelines took: 0.808s
Sum of 21 pipelines took: 2.921s
Sum of 31 pipelines took: 7.425s
Sum of 41 pipelines took: 13.902s

marrrcin
1 month ago
Funny thing is that we’ve been recommending them to keep the nodes small and now it got back on us 😄

Deepyaman Datta
1 month ago
I think your intuition is right re toposort; just looking at the code, using the nodes accessor doesn't make sense here.

Deepyaman Datta
1 month ago
Let me see if something like #3146 will pass
#3146 Don't toposort nodes in non-user-facing operations
NOTE: Kedro datasets are moving from kedro.extras.datasets to a separate kedro-datasets package in
kedro-plugins repository. Any changes to the dataset implementations
should be done by opening a pull request in that repository.
Description
Development notes
Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.
If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commi… Show more
https://github.com/[kedro-org/kedro](https://github.com/kedro-org/kedro)|kedro-org/kedrokedro-org/kedro | Oct 9th | Added by GitHub
❤️
1

Deepyaman Datta
1 month ago
I feel like there's a lot of unnecessary sorting going on 🙂

datajoely
1 month ago
there is also the parallel point about deterministic toposort that people keep asking for

Ivan Danov
1 month ago
It might quite well be toposort, I think I vaguely remember that the library we are using is not very scaleable....

juanlu
1 month ago
leaving a comment directly on the PR 👍:skin-tone-3:

juanlu
1 month ago
also thinking a lot about a certain :knuth: quote
👀
1

datajoely
1 month ago
https://docs.python.org/3/library/graphlib.html
Graphlib was added to python 3.9
Python documentationPython documentation
graphlib — Functionality to operate with graph-like structures
Source code: Lib/graphlib.py Exceptions: The graphlib module defines the following exception classes:
❤️
1

datajoely
1 month ago
we can now use it

juanlu
1 month ago
omg I had no idea about stdlib graphlib (edited)

juanlu
1 month ago
when are we dropping Python 3.8? 😄

datajoely
1 month ago
oh man got 3.7 mixed up

datajoely
1 month ago
we could do the whole conditional stuff and use graphlib if on the right version

Ivan Danov
1 month ago
https://pypi.org/project/graphlib-backport/
PyPIPyPI
graphlib-backport
Backport of the Python 3.9 graphlib module for Python 3.6+ (7 kB)
https://pypi.org/project/graphlib-backport/

🥳
1

datajoely
1 month ago
cannot wait to do this
https://twitter.com/cto_junior/status/1711067515600646585
X (formerly Twitter)X (formerly Twitter)
TDM (e/flλ) on X
A story in 4 parts.
Can't wait to offload more L5 tasks to GPT-4 (114 kB)
https://twitter.com/cto_junior/status/1711067515600646585

Deepyaman Datta
1 month ago
@marrrcin
went ahead and made sure tests and stuff are passing; can you check if #3146 resolves your issue, when you get a chance? Since you've already got your little test script, I assume. 🙂
#3146 Don't toposort nodes in non-user-facing operations
NOTE: Kedro datasets are moving from kedro.extras.datasets to a separate kedro-datasets package in
kedro-plugins repository. Any changes to the dataset implementations
should be done by opening a pull request in that repository.
Description
Development notes
Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.
If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commi… Show more
Comments
1
https://github.com/[kedro-org/kedro](https://github.com/kedro-org/kedro)|kedro-org/kedrokedro-org/kedro | Oct 9th | Added by GitHub

marrrcin
1 month ago
@Deepyaman Datta

There is a small improvement, but it’s less than 5%
😢
1

Deepyaman Datta
1 month ago
I'm happy to look into this more thoroughly at some point (sounds like something I'd like to do), but I'm not sure just how soon. 😅

marrrcin
1 month ago
I’ve diagnosed it down with pyinstrument
@juanlu

  • the most time is spent on two functions:
    init of Pipeline class
    add of Pipeline class <--- this one would most likely be fixed partially by your PR
    @Deepyaman Datta
    , I will run one more test with your code to confirm that
    🙌
    1

marrrcin
1 month ago
OK,
@Deepyaman Datta

  • with your fix, the add now takes less time - for the same tests it went down from 10.1s self-time to 2.1s self-time; init is faster too, but not as much. It’s definitely something 🙌
    🚀
    3

marrrcin
1 month ago
Added #3167 for that
#3167 Low performance of pipeline sums
Description
I have a project where there is a huge number of pipelines generated programatically (in a loop). The process of generating those pipelines takes a lot of time and it seems to be quadratic (see the chart below).
plot
n - number of pipelines to sum
time - time in seconds
The problem has 2 variants:

  1. Large number of small pipelines
  2. Small number of pipelines with large node count (200+).
    Context
    While Kedro encourages to keep the nodes small and pipelines modular - extensive use of both of those features/approaches lead to slow project startup times.
    Steps to Reproduce
  3. Create a project from spaceflights starter.
  4. Change data_processing pipeline to:
    Show the code ⬇️
    def create_pipeline(…
    Show less
    https://github.com/[kedro-org/kedro](https://github.com/kedro-org/kedro)|kedro-org/kedrokedro-org/kedro | Oct 12th | Added by GitHub
    Low performance of pipeline sums #3167

👍:skin-tone-3:👍
2
❤️
1

Deepyaman Datta
3 days ago
Doing some cleanup of my PRs--can I get reviews on #3146, please? 🙂 Since it is an improvement regardless.
#3146 Don't toposort nodes in non-user-facing operations
NOTE: Kedro datasets are moving from kedro.extras.datasets to a separate kedro-datasets package in
kedro-plugins repository. Any changes to the dataset implementations
should be done by opening a pull request in that repository.
Description
https://kedro-org.slack.com/archives/C054U0LJLAC/p1696862244884179
Development notes
Docs failure fixed by #3152
Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.
If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the G… Show more
Comments
1
Assignees
@deepyaman
https://github.com/[kedro-org/kedro](https://github.com/kedro-org/kedro)|kedro-org/kedrokedro-org/kedro | Oct 9th | Added by GitHub

Click to Expand

@noklam
Copy link
Contributor

noklam commented Mar 8, 2024

@deepyaman Are you still interested to continue with this PR?

Signed-off-by: Ivan Danov <idanov@users.noreply.github.com>
Signed-off-by: Ivan Danov <idanov@users.noreply.github.com>
Copy link
Member

@idanov idanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still an improvement over the original, since every time we call self.nodes, we copy the whole list. We can save up on making new copies of that list for internal non-mutating tasks. Not sure whether the stated goal of the PR is met though, since the toposorting happens only once anyways. Should we refocus this PR and make it about avoiding unnecessary copies through using list, set and copy in different places internally?

@@ -2,6 +2,7 @@

## Major features and improvements
* Improved error message when passing wrong value to node.
* Improved performance of pipeline operations by using non-toposorted nodes in internal code.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Improved performance of pipeline operations by using non-toposorted nodes in internal code.
* Improved performance of pipeline operations by avoiding unnecessary copies of the nodes.

@merelcht
Copy link
Member

I'm guessing this PR can be closed now #3730 is merged?

@deepyaman
Copy link
Member Author

I'm guessing this PR can be closed now #3730 is merged?

Seems like that is a superset of my changes, so sure.

@deepyaman deepyaman closed this Apr 3, 2024
auto-merge was automatically disabled April 3, 2024 13:25

Pull request was closed

@deepyaman deepyaman deleted the deepyaman-patch-1 branch April 3, 2024 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants