Fix PruningPredicate interaction with DynamicFilterPhysicalExpr that references partition columns by adriangb · Pull Request #19129 · apache/datafusion

adriangb · 2025-12-06T13:08:53Z

Fix handling of DynamicFilterPhysicalExpr that references partition columns
Adds some integration tests for handling of literal expression trees, making sure that if they are passed through PhysicalExprSimplifier before PruningPredicate we are able to prune.
Refactors internal tracking of column counts to short circuit early and make match logic easier to follow

adriangb · 2025-12-06T15:36:41Z

Todo: add test that is always true. Add test for nested / complex literal trees.

adriangb · 2025-12-06T17:17:20Z

datafusion/datasource-parquet/src/opener.rs

                enable_page_index: false,
                enable_bloom_filter: false,
-                enable_row_group_stats_pruning: true,
+                enable_row_group_stats_pruning: false, // note that this is false!


Otherwise the test failed because the predicate would successfully prune based on stats

…19130) This improves handling of constant expressions during pruning by trying to evaluate them in the simplifier and the pruning machinery. This is somewhat redundant with #19129 in the simple case of our Parquet implementation but since there may be edge cases where one is hit and not the other, or where users are using them independently I thought it best to implement both approaches.

adriangb · 2025-12-08T13:32:14Z

datafusion/pruning/src/pruning_predicate.rs

+/// Count of distinct column references in an expression.
+/// This is the same as [`collect_columns`] but optimized to stop counting
+/// once more than one distinct column is found.
+///
+/// For example, in expression `col1 + col2`, the count is `Many`.
+/// In expression `col1 + 5`, the count is `One`.
+/// In expression `5 + 10`, the count is `Zero`.
+#[derive(Debug, PartialEq, Eq)]
+enum ColumnReferenceCount {


This replaces collect_columns because:

We only ever want to know if there's one or more, this short circuits / avoids extra work if we're going to bail anyway.

Makes the match statements clearer instead of matching on .len() integers.

Avoids columns.iter().first().unwrap() later on (even though this does still contain an unwrap internally)

alamb

Thanks @adriangb

This looks interesting -- the only thing I don't understand is why the previously added constant folding expression doesn't cover this

alamb · 2025-12-08T13:29:23Z

datafusion/pruning/src/pruning_predicate.rs

+
+    // Test that always-true literal predicates don't prune any containers
+    #[test]
+    fn row_group_predicate_literal_true() {


Can we please also add a test for literal (boolean) null?

Added row_group_predicate_literal_null

alamb · 2025-12-08T13:30:27Z

datafusion/pruning/src/pruning_predicate.rs

+        prune_with_expr(lit(true).or(lit(false)), &schema, &statistics, &[true]);
+
+        // Complex nested: (1 < 2) AND (3 > 1) = true AND true = true
+        prune_with_expr(


Can you also please add an error test for pruning with a non boolean (e.g. lit(1i32)) -- and just verify that it errors resonably (rather than gives the wrong answer)

alamb · 2025-12-08T13:33:00Z

datafusion/pruning/src/pruning_predicate.rs

+
+    // Handle literal-to-literal comparisons (no columns on either side)
+    // e.g., lit(1) = lit(2) should evaluate to false and prune all containers
+    if left_columns.is_empty() && right_columns.is_empty() {


This seems redundant with the constant folding introduced in #19130

Why do we need both? Maybe we just need to constant fold the expression after applying the physical expr adapter 🤔

adriangb · 2025-12-08T14:14:36Z

This looks interesting -- the only thing I don't understand is why the previously added constant folding expression doesn't cover this

It does overlap with that work. My reasoning for doing it in two places was that these are two disconnected APIs (i.e. we don't require you to run the simplifier before calling PruningPredicate::try_new) and there may be teams using one without the other. The alternative to this would be to recommend calling the simplifier before calling PruningPredicate and not support this in PruningPredicate? If so I can refactor to recommend that and do so in the tests (i.e. verify that the integration works without implementing the feature here as well).

adriangb · 2025-12-08T14:28:20Z

@alamb I've removed the handling of literals and instead added documentation and integration tests.

So this PR is now tests + refactoring to short circuit collect_columns.

adriangb · 2025-12-08T16:48:48Z

Okay I did find the one case that this covers: select * from t order by part_col, col limit 10.

This will generate a dynamic filter that references part_col, but since the it's buried in a dynamic filter the simplifier won't simplfiy it. I was able to work around that: 2f591b8

adriangb · 2025-12-08T17:20:58Z

Okay I did find the one case that this covers: select * from t order by part_col, col limit 10.

This will generate a dynamic filter that references part_col, but since the it's buried in a dynamic filter the simplifier won't simplfiy it. I was able to work around that: 2f591b8

I've copied that change over to here, it seems more appropriate and fits in with the original goal of this PR

adriangb · 2025-12-08T17:22:02Z

datafusion/physical-expr-common/src/physical_expr.rs

+///
+/// Returns a `[`Transformed`] indicating whether a snapshot was taken,
+/// along with the resulting `PhysicalExpr`.
+pub fn snapshot_physical_expr_opt(


The idea here is that instead of doing 1 traversal to determine if it's a dynamic expression and another to snapshot we can do a single traversal. This also handles the case where an arbitrary PhysicalExpr implements snapshotting that is not a dynamic filter.

adriangb · 2025-12-08T17:22:51Z

datafusion/pruning/src/pruning_predicate.rs

+            .with("c1", ContainerStats::new_i32(vec![Some(0)], vec![Some(10)]));
+        let expected_ret = &[true];
+        prune_with_expr(lit(1), &schema, &statistics, expected_ret);
+    }


@alamb this is the other test you asked for

alamb · 2025-12-08T21:29:53Z

datafusion/pruning/src/pruning_predicate.rs

-        // which does not handle dynamic exprs in general
-        let expr = snapshot_physical_expr(expr)?;
+    ///
+    /// Note that `PruningPredicate` does not attempt to normalize or simplify


it seems to me like PruningPredicate does now actually call simplify 🤔 (if it is a snapshot )

Will update

alamb · 2025-12-08T21:30:43Z

datafusion/pruning/src/pruning_predicate.rs

+            // children after snapshotting and previously `replace_columns_with_literals` may have been called with partition values
+            // the expression we have now is `8 < 5 and col < 10`.
+            // Thus we need as simplifier pass to get `false and col < 10` => `false` here.
+            let simplifier = PhysicalExprSimplifier::new(&schema);


Since this code is specific to dynamic expressions, maybe the call to simplify would make more sense in the snapshot_physical_expr_opt method itself?

Hmm interesting. I think maybe best to keep things as is. E.g. if you're going to evaluate the expression against data (as opposed to doing the kind of weird stuff PruningPredicate does) then maybe you don't want to pay the simplify cost?

alamb · 2025-12-08T21:31:52Z

datafusion/pruning/src/pruning_predicate.rs

+        expr.apply(|expr| {
+            if let Some(column) = expr.as_any().downcast_ref::<phys_expr::Column>() {
+                seen.insert(column.clone());
+                if seen.len() > 1 {


I am surprised clippy didn't complain about this not using is_empty 🤔

I think because len() > 1 != len >= 1

github-actions bot added the datasource Changes to the datasource crate label Dec 6, 2025

This was referenced Dec 6, 2025

Add constant expression evaluator to physical expression simplifier #19130

Merged

Move partition handling out of PhysicalExprAdapter #19128

Merged

adriangb commented Dec 6, 2025

View reviewed changes

adriangb mentioned this pull request Dec 7, 2025

Support Push down expression evaluation in TableProviders #14993

Closed

adriangb force-pushed the prune-literals branch from eebccef to b1c49b4 Compare December 7, 2025 14:44

github-actions bot removed the datasource Changes to the datasource crate label Dec 7, 2025

alamb mentioned this pull request Dec 8, 2025

Andrew Lamb Weekly-ish Open Source plan - 2025-12-08 #19210

Closed

40 tasks

adriangb force-pushed the prune-literals branch from b1c49b4 to 7d3efc8 Compare December 8, 2025 13:19

github-actions bot added the physical-expr Changes to the physical-expr crates label Dec 8, 2025

adriangb commented Dec 8, 2025

View reviewed changes

alamb reviewed Dec 8, 2025

View reviewed changes

adriangb requested a review from alamb December 8, 2025 15:10

adriangb force-pushed the prune-literals branch from a60b943 to 49ebece Compare December 8, 2025 15:27

github-actions bot removed the physical-expr Changes to the physical-expr crates label Dec 8, 2025

adriangb commented Dec 8, 2025

View reviewed changes

adriangb changed the title ~~Support literal-only predicates in PruningPredicate~~ Fix PruningPredicate interaction with DynamicFilterPhysicalExpr that references partition columns Dec 8, 2025

adriangb added 6 commits December 8, 2025 12:36

Support literal-only predicates in PruningPredicate

34ab83b

fix test

98576f4

fmt

2eadf7c

tweak

6362f00

Add tests for always-true and complex literal predicates

8b5a73b

typo

db1670b

adriangb added 5 commits December 8, 2025 12:36

fmt

a72d95a

short circuit column collection

8d863fa

remove duplicate implementation

0ce692e

add handling of dynamic filters with replaced children

a43e15f

lint

ecad024

adriangb force-pushed the prune-literals branch from f6b92f8 to ecad024 Compare December 8, 2025 18:37

github-actions bot added the physical-expr Changes to the physical-expr crates label Dec 8, 2025

lint

d67d43d

alamb approved these changes Dec 8, 2025

View reviewed changes

update comment

0cd3f55

adriangb added this pull request to the merge queue Dec 9, 2025

Merged via the queue into apache:main with commit 83736ef Dec 9, 2025
31 checks passed

adriangb deleted the prune-literals branch December 9, 2025 03:00

Conversation

adriangb commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adriangb commented Dec 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adriangb commented Dec 8, 2025

Uh oh!

adriangb commented Dec 8, 2025

Uh oh!

adriangb commented Dec 8, 2025

Uh oh!

adriangb commented Dec 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adriangb commented Dec 6, 2025 •

edited

Loading