Skip to content

Benchmarks for planning queries #8638

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

DataFusion has a variety of benchmarks we use for query execution -- that is how long it takes to run a query.

There are no equivalent benchmark suite for how long it takes to plan a query, an area that many people have highlighted as an area of DataFusion they would like to improve. (see #5637 for various ideas)

Recently we have had some PRs such as #7942 and #7870 that propose some non trivial planning change, including some micro benchmarks that show good promise. However, we don't have an agreed upon way to measure the changes overall impacts

Describe the solution you'd like

As suggested by @Dandandan #7942 (comment)

I suggest to also add some benchmarking. We could take for example TCP-H and TCP-DS (which we already have in the benchmarks / tests) and benchmark the time it takes to plan/optimize the queries rather than execute them.

Specifically, I propose adding benchmarks (with documentation about why they are included) in

https://github.com/apache/arrow-datafusion/blob/03c2ef46f2d88fb015ee305ab67df6d930b780e2/datafusion/core/benches/sql_planner.rs

The code would basically do

  1. Create the schema
  2. Plan the relevant query (create the physical plan) but not execute it

Contents:

Describe alternatives you've considered

On alternative could be to update the dfbench tests so they can just plan but not run the queries:

It seems it might not be much work adding an option to the benchmark code to only perform the planning rather than executing the queries.

The dfbench code is here: https://github.com/apache/arrow-datafusion/blob/main/benchmarks/src/bin/dfbench.rs

Additional context

No response

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions