Skip to content

Improve join performance for h2o queries #13765

Open
@MrPowers

Description

@MrPowers

Is your feature request related to a problem or challenge?

DataFusion joins are generally performant, but they can start erroring out when memory becomes limited.

Here are the h2o queries run on my local machine (Macbook M3 with 16 GB of RAM):

h2o-join

DataFusion performs really well except for query 5, which joins two 100 million row tables. DataFusion errors out for query 5 on my machine.

DataFusion is the fastest option when joining a 100 million row table with a 100 row or 100,000 row table.

These same queries are more performant in the official benchmarks which are run on a really powerful machine:

Screenshot 2024-12-13 at 1 36 28 PM

The official benchmarks show an error for DataFusion on the 1 billion row table:

Screenshot 2024-12-13 at 1 37 36 PM

So, I am not sure about the underlying issue, but seems like there are problems when memory becomes limited.

Describe the solution you'd like

Hopefully DataFusion can perform similar to other engines for large table to large table joins.

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions