Open
Conversation
Add `should_push_through_operator` method to `ProjectionExprs` that checks whether a projection provides actual benefit when pushed through operators like Filter, Sort, Repartition, etc. A projection should be pushed through when it is: 1. Trivial (no expensive computations to duplicate) 2. AND provides benefit via one of: - Narrowing the schema (fewer output columns than input) - Having field accessors that reduce data size - Having literals that can be absorbed by the datasource Column-only projections that just rename without narrowing the schema are NOT pushed through, as they provide no benefit. This fixes test failures where column-only renaming projections were incorrectly being pushed through filters and other operators. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR implements projection pushdown for nested expressions, specifically enabling get_field operations on struct/map columns to be pushed down through various operators (filters, repartitions, sorts) to the data source level. This optimization reduces data processing by allowing field extraction to happen earlier in the query plan.
Changes:
- Introduces a new
ArgTrivialityclassification system to distinguish between trivial expressions (columns, literals, field accessors) and non-trivial expressions (computations) - Implements projection splitting to extract beneficial sub-expressions from mixed projections
- Updates pushdown logic across operators (RepartitionExec, FilterExec, SortExec, etc.) to allow trivial expressions through
- Enhances
update_exprwith two-pass rewriting to handle expression matching and column remapping
Reviewed changes
Copilot reviewed 28 out of 30 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
datafusion/expr-common/src/triviality.rs |
New file defining ArgTriviality enum for expression classification |
datafusion/functions/src/core/getfield.rs |
Implements triviality logic for get_field function |
datafusion/physical-optimizer/src/projection_pushdown.rs |
Adds projection splitting logic to extract beneficial sub-expressions |
datafusion/physical-plan/src/projection.rs |
Adds is_trivial_or_narrows_schema helper and updates projection merging logic |
datafusion/physical-plan/src/filter.rs |
Updates filter pushdown to use new triviality-based checks |
datafusion/physical-plan/src/repartition/mod.rs |
Removes restrictive checks to allow trivial expression pushdown |
datafusion/physical-plan/src/sorts/*.rs |
Updates sort operators to allow pushdown without schema narrowing requirement |
datafusion/physical-expr/src/projection.rs |
Implements two-pass expression rewriting in update_expr |
datafusion/core/tests/physical_optimizer/projection_pushdown.rs |
Adds comprehensive test for nested field accessor pushdown |
*.slt files |
Updates expected query plans to reflect new pushdown behavior |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Allow TrivialExpr as a valid base for get_field triviality check, enabling nested field access like get_field(get_field(col, 'a'), 'b') to be considered trivial. Literal base is explicitly not considered trivial since it would be constant-folded anyway. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Fix typo "expressions" -> "expression" in triviality.rs - Add missing period in coalesce_partitions.rs comment - Fix typo "expressions" -> "expression" in projection_pushdown.rs - Clarify comment in projection.rs about beneficial expressions Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
When SortExec or SortPreservingMergeExec has a fetch (TopK behavior), they act as filters reducing rows. Pushing non-trivial projections (like literals) through them causes the expression to be evaluated on all input rows instead of just the filtered output rows. Added is_trivial_or_narrows_schema check for sort operators with fetch to prevent pushing literals and computations below TopK operators. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add comprehensive sqllogictest coverage for projection pushdown optimization with struct field access (get_field) expressions. Test coverage includes: - Basic get_field pushdown into DataSourceExec - Nested struct access (s['outer']['inner']) - Projection through Filter, Sort, and TopK operators - Multi-partition scenarios with SortPreservingMergeExec - Edge cases: nullable structs, common subexpressions, literals Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.