Low performance of pipeline sums #3167

marrrcin · 2023-10-12T11:22:42Z

Description

I have a project where there is a huge number of pipelines generated programatically (in a loop). The process of generating those pipelines takes a lot of time and it seems to be quadratic (see the chart below).

n - number of pipelines to sum
time - time in seconds

The problem has 2 variants:

Large number of small pipelines
Small number of pipelines with large node count (200+).

Context

While Kedro encourages to keep the nodes small and pipelines modular - extensive use of both of those features/approaches lead to slow project startup times.

The most severe impact of this issue is in mono-repo setups, where multiple teams work in the same project but on separate pipelines - in such setups the number of pipelines grows quickly as the development proceeds.

Steps to Reproduce

Create a project from spaceflights starter.
Change data_processing pipeline to:

Show the code ⬇️

def create_pipeline(**kwargs) -> Pipeline:
    data_engineering_pipeline = pipeline(
        [
            node(
                func=preprocess_companies,
                inputs="companies",
                outputs="preprocessed_companies",
                name="preprocess_companies_node",
            ),
            node(
                func=preprocess_shuttles,
                inputs="shuttles",
                outputs="preprocessed_shuttles",
                name="preprocess_shuttles_node",
            ),
            node(
                func=create_model_input_table,
                inputs=["preprocessed_shuttles", "preprocessed_companies", "reviews"],
                outputs="model_input_table",
                name="create_model_input_table_node",
            ),
        ] + [node(
            func=lambda x: print("YOLO", x),
            inputs="parameters",
            outputs=f"yolo_{i}",
            name=f"yolo_{i}"
        ) for i in range(200)]
    )

    # Poor man's performance test
    import time
    pipelines = []
    MAX = 60
    for i in range(MAX + 1):
        pipelines.append(
            pipeline(
                data_engineering_pipeline,
                inputs={"companies": "companies",
                        "shuttles": "shuttles",
                        "reviews": "reviews"},
                namespace=f"namespace_{i}",
            )
        )
    data = []
    for n in range(1, MAX, 10):
        start = time.monotonic()
        _ = sum(pipelines[:n])
        end = time.monotonic()
        print(f"Sum of {n} pipelines took: {end - start:0.3f}s")
        data.append((n, end - start))


    # uncomment to output chart / data
    # import pandas as pd
    # df = pd.DataFrame(data, columns=["n", "time"])
    # df.plot.scatter(x="n", y="time").get_figure().savefig("plot.png")
    return sum(pipelines)

Run kedro registry list

Expected Result

Pipelines are listed quickly.

Actual Result

The pipelines are listed after a few minutes (depending on the number of pipelines/nodes), with the time increasing quadratically (see the chart above).

Possible causes

The main problem is that internally, the pipelines are summed __add__ and then __init__ in the Pipeline class. The slowness of the operations inside of the __add__ itself is partially addressed by #3146 but the problem with the __init__ still remains - maybe the calls to _topologically_sorted in the constructor are the root cause. It would require more detailed profiling.

Your Environment

Kedro version used: 0.18.13
Python version used: 3.10.13
Operating system and version: macOS 13.0.1

The text was updated successfully, but these errors were encountered:

astrojuanlu · 2024-05-13T08:52:01Z

Was this fully addressed by #3730?

marrrcin · 2024-05-13T11:22:47Z

I hope so 🤞🏻

github-actions bot mentioned this issue Nov 1, 2023

Monthly issue metrics report #3256

Closed

marrrcin mentioned this issue Nov 14, 2023

Don't toposort nodes in non-user-facing operations #3146

Closed

7 tasks

idanov mentioned this issue Mar 21, 2024

Optimise pipeline addition and creation #3730

Merged

7 tasks

marrrcin closed this as completed May 13, 2024

astrojuanlu mentioned this issue Oct 30, 2024

Investigate performance of config loading for big projects #3893

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Low performance of pipeline sums #3167

Low performance of pipeline sums #3167

marrrcin commented Oct 12, 2023 •

edited

Loading

astrojuanlu commented May 13, 2024

marrrcin commented May 13, 2024

Low performance of pipeline sums #3167

Low performance of pipeline sums #3167

Comments

marrrcin commented Oct 12, 2023 • edited Loading

Description

Context

Steps to Reproduce

Expected Result

Actual Result

Possible causes

Your Environment

astrojuanlu commented May 13, 2024

marrrcin commented May 13, 2024

marrrcin commented Oct 12, 2023 •

edited

Loading