Skip to content

Potential performance regression for TPCH q18 #13188

Open
@alamb

Description

@alamb

Describe the bug

While enabling StringView reading from Parquet in #13101 @Dandandan noticed a slight regression for TPCH 18 #13101 (comment)

here is the query

select
    c_name,
    c_custkey,
    o_orderkey,
    o_orderdate,
    o_totalprice,
    sum(l_quantity)
from
    customer,
    orders,
    lineitem
where
        o_orderkey in (
        select
            l_orderkey
        from
            lineitem
        group by
            l_orderkey having
                sum(l_quantity) > 300
    )
  and c_custkey = o_custkey
  and o_orderkey = l_orderkey
group by
    c_name,
    c_custkey,
    o_orderkey,
    o_orderdate,
    o_totalprice
order by
    o_totalprice desc,
    o_orderdate;

To Reproduce

To reproduce

Make data

# make the data and get to the correct location
cd datafusion/benchmarks
./bench.sh data tpch
cd data/tpch_sf1

Run query:

datafusion-cli -f ../../queries/q18.sql  | grep Elapsed
Elapsed 0.088 seconds.

When StringView is enabled it seems like it is slightly slower

Expected behavior

StringView should always be faster

Additional context

I took a brief look at the flamegraphs -- it seems like one difference could be BatchCoalescer::push_batch

Screenshot 2024-10-30 at 2 13 38 PM

There is a special case for StringView here:
https://github.com/apache/datafusion/blob/6034be42808b43e3f48f6e58ec38cc35fa253abb/datafusion/physical-plan/src/coalesce/mod.rs#L117-L116

Here are the explain plans for the query before and after the change

Here are the flamegraphs for the query before/after the change

  • q18-flamegraph-after
  • q18-flamegraph-before

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions