Closed
Description
Describe the bug
group by high cardinality column in datafusion 10 times slower than low cardinality column.
also i tested on other olap engine, there are only 2 times slow or less;
-
trino olap engine write by java
low cardinality usage ms: 1400ms± high cardinality usage ms: 2700ms±
-
doris olap engine write by c++
low cardinality usage ms: 350ms± high cardinality usage ms: 500ms±
To Reproduce
Steps to reproduce the behavior:
parquet table with 60,000,000 rows; data generate by ssb-dbgen
group by LO_ORDERPRIORITY
SELECT sum(LO_EXTENDEDPRICE) AS revenue FROM lineorder_flat group by LO_ORDERPRIORITY;
5 rows in set. Query took 0.341 seconds.
group by S_ADDRESS
SELECT sum(LO_EXTENDEDPRICE) AS revenue FROM lineorder_flat group by S_ADDRESS;
20000 rows in set. Query took 2.582 seconds.
Expected behavior
should some with other engine;
Additional context
Add any other context about the problem here.