Open
Description
Describe the bug
While enabling StringView
reading from Parquet in #13101 @Dandandan noticed a slight regression for TPCH 18 #13101 (comment)
here is the query
select
c_name,
c_custkey,
o_orderkey,
o_orderdate,
o_totalprice,
sum(l_quantity)
from
customer,
orders,
lineitem
where
o_orderkey in (
select
l_orderkey
from
lineitem
group by
l_orderkey having
sum(l_quantity) > 300
)
and c_custkey = o_custkey
and o_orderkey = l_orderkey
group by
c_name,
c_custkey,
o_orderkey,
o_orderdate,
o_totalprice
order by
o_totalprice desc,
o_orderdate;
To Reproduce
To reproduce
Make data
# make the data and get to the correct location
cd datafusion/benchmarks
./bench.sh data tpch
cd data/tpch_sf1
Run query:
datafusion-cli -f ../../queries/q18.sql | grep Elapsed
Elapsed 0.088 seconds.
When StringView is enabled it seems like it is slightly slower
Expected behavior
StringView should always be faster
Additional context
I took a brief look at the flamegraphs -- it seems like one difference could be BatchCoalescer::push_batch
There is a special case for StringView here:
https://github.com/apache/datafusion/blob/6034be42808b43e3f48f6e58ec38cc35fa253abb/datafusion/physical-plan/src/coalesce/mod.rs#L117-L116
Here are the explain plans for the query before and after the change
Here are the flamegraphs for the query before/after the change