Skip to content

group by high cardinality column in datafusion 10 times slower than low cardinality column #1246

Closed
@jiangzhx

Description

@jiangzhx

Describe the bug
group by high cardinality column in datafusion 10 times slower than low cardinality column.
also i tested on other olap engine, there are only 2 times slow or less;

  • trino olap engine write by java

    low cardinality  usage ms: 1400ms±
    high cardinality  usage ms: 2700ms±
    
  • doris olap engine write by c++

    low cardinality  usage ms: 350ms±
    high cardinality  usage ms: 500ms±
    

To Reproduce
Steps to reproduce the behavior:
parquet table with 60,000,000 rows; data generate by ssb-dbgen

group by LO_ORDERPRIORITY

SELECT sum(LO_EXTENDEDPRICE) AS revenue  FROM lineorder_flat group by LO_ORDERPRIORITY;
5 rows in set. Query took 0.341 seconds.

group by S_ADDRESS

SELECT sum(LO_EXTENDEDPRICE) AS revenue  FROM lineorder_flat group by S_ADDRESS;
20000 rows in set. Query took 2.582 seconds.

Expected behavior
should some with other engine;

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions