Add struct pushdown query benchmark and projection pushdown tests#19962
Add struct pushdown query benchmark and projection pushdown tests#19962adriangb merged 2 commits intoapache:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR extracts benchmarks and sqllogictest cases from PR #19538 for easier review, focusing on testing struct field access projection pushdown optimization in DataFusion.
Changes:
- Added comprehensive benchmark suite for SQL queries on struct columns in Parquet files with 20 different query patterns
- Added 1000+ line SQLLogicTest file covering projection pushdown behavior with get_field expressions through various operators
- Updated Cargo.toml to register the new benchmark
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| datafusion/core/benches/parquet_struct_query.rs | New benchmark file testing struct field queries on Parquet data with various SQL patterns (filters, joins, aggregations, etc.) |
| datafusion/core/Cargo.toml | Added benchmark entry for parquet_struct_query with parquet feature requirement |
| datafusion/sqllogictest/test_files/projection_pushdown.slt | Comprehensive test suite for get_field projection pushdown through Filter, Sort, TopK, and multi-partition scenarios |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Extract benchmarks and sqllogictest cases from apache#19538 for easier review. Includes a new benchmark for SQL queries on struct columns in Parquet files, covering struct access, filtering, joins, and aggregations with 524K rows and 8 row groups. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
414b451 to
30b5888
Compare
| logical_plan | ||
| 01)Projection: simple_struct.id, get_field(simple_struct.s, Utf8("value")) | ||
| 02)--TableScan: simple_struct projection=[id, s] | ||
| physical_plan DataSourceExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/projection_pushdown/simple.parquet]]}, projection=[id, get_field(s@1, value) as simple_struct.s[value]], file_type=parquet |
There was a problem hiding this comment.
It is interesting that these expressions have already been pushed down to the datasource
There was a problem hiding this comment.
Yep in some cases (no sort, no repartition, etc) it already works, but only because all projections are pushed down.
| @@ -241,6 +241,11 @@ harness = false | |||
| name = "parquet_query_sql" | |||
| required-features = ["parquet"] | |||
There was a problem hiding this comment.
Is there any reason not to just add the benchmarks to parquet_query_sql?
There was a problem hiding this comment.
I could but it’s kind of nice to be able to run them in isolation easily at least for now while we’re developing just these. And in some sense the feature we’re working on needn’t be parquet specific (eg Vortex). We can always fold them later.
|
Thanks @alamb ! |
Summary
Extract benchmarks and sqllogictest cases from #19538 for easier review.
This PR includes:
New Benchmark:
parquet_struct_query.rs- Benchmarks SQL queries on struct columns in Parquet filesid(Int32) ands(Struct withid/Int32 andvalue/Utf8 fields)SQLLogicTest:
projection_pushdown.slt- Tests for projection pushdown optimizationChanges
datafusion/core/benches/parquet_struct_query.rsdatafusion/core/Cargo.tomlwith benchmark entrydatafusion/sqllogictest/test_files/projection_pushdown.sltTest Plan
cargo bench --profile dev --bench parquet_struct_query🤖 Generated with Claude Code