Support inferring new predicates to push down #15906

xudong963 · 2025-04-30T15:35:07Z

Which issue does this PR close?

Closes #.

Rationale for this change

We can infer new predicates from existing predicates to push down to reduce IO and improve performance dramatically

What changes are included in this PR?

Infers new predicates by substituting equalities.
For example, with predicates t2.b = 3 and t1.b > t2.b, we can infer t1.b > 3.

See sqllogictest for another case.

Are these changes tested?

Yes

Are there any user-facing changes?

xudong963 · 2025-04-30T15:35:47Z

datafusion/sqllogictest/test_files/push_down_filter.slt

+logical_plan
+01)Inner Join: t1.a = t2.a
+02)--Projection: t1.a, t1.b
+03)----Filter: __common_expr_4 >= Int64(3) AND __common_expr_4 <= Int64(5)


The inferred predicate, which can be pushed down to the t1 scan.

xudong963 · 2025-04-30T15:36:56Z

datafusion/optimizer/src/push_down_filter.rs

+/// Infers new predicates by substituting equalities.
+/// For example, with predicates `t2.b = 3` and `t1.b > t2.b`,
+/// we can infer `t1.b > 3`.
+fn infer_predicates_from_equalities(predicates: Vec<Expr>) -> Result<Vec<Expr>> {


In the future, we can move the code into a dedicated optimizer rule, such as InferPredicates

I think this might be a special case of the range analysis code in

https://docs.rs/datafusion/latest/datafusion/physical_expr/intervals/cp_solver/index.html

In other words, instead of this special case maybe we could use the cp_solver to create a more general framework for introducing inferred predicates 🤔

Now that we have predicate pushdown for ExecutionPlans maybe it is more realistic to do this

I'll check the cp_solver (didn't notice the part of code before)

Great suggestion. I can help if you need some directions or have any confusion

xudong963 · 2025-04-30T15:43:16Z

datafusion/expr/src/expr_rewriter/mod.rs

@@ -131,13 +131,25 @@ pub fn normalize_sorts(
 }

 /// Recursively replace all [`Column`] expressions in a given expression tree with
-/// `Column` expressions provided by the hash map argument.
-pub fn replace_col(expr: Expr, replace_map: &HashMap<&Column, &Column>) -> Result<Expr> {


I don't wanna write a similar method for the PR, so made the method generic

pub fn replace_col_with_expr( expr: Expr, replace_map: &HashMap<Column, &Expr>, ) -> Result<Expr> { expr.transform(|expr| { Ok({ if let Expr::Column(c) = &expr { match replace_map.get(c) { Some(new_expr) => Transformed::yes((**new_expr).to_owned()), None => Transformed::no(expr), } } else { Transformed::no(expr) } }) }) .data() }

Omega359 · 2025-04-30T18:14:17Z

Perhaps running clickbench or equivalent (assuming clickbench wouldn't trigger this optimization) to showcase the difference would be good?

alamb · 2025-05-05T19:19:48Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing infer_filter (d0aac59) to af99b54 diff
Benchmarks: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

alamb · 2025-05-05T20:02:27Z

🤖: Benchmark completed

Details

Comparing HEAD and infer_filter
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ infer_filter ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │  1918.57ms │    1907.49ms │     no change │
│ QQuery 1     │   676.14ms │     741.17ms │  1.10x slower │
│ QQuery 2     │  1387.03ms │    1520.49ms │  1.10x slower │
│ QQuery 3     │   720.59ms │     712.25ms │     no change │
│ QQuery 4     │  1490.52ms │    1472.76ms │     no change │
│ QQuery 5     │ 15473.75ms │   15353.95ms │     no change │
│ QQuery 6     │  2069.18ms │    2073.35ms │     no change │
│ QQuery 7     │  2875.14ms │    2642.26ms │ +1.09x faster │
└──────────────┴────────────┴──────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary           ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)           │ 26610.92ms │
│ Total Time (infer_filter)   │ 26423.71ms │
│ Average Time (HEAD)         │  3326.36ms │
│ Average Time (infer_filter) │  3302.96ms │
│ Queries Faster              │          1 │
│ Queries Slower              │          2 │
│ Queries with No Change      │          5 │
└─────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ infer_filter ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.35ms │       2.33ms │     no change │
│ QQuery 1     │    39.48ms │      39.13ms │     no change │
│ QQuery 2     │    92.77ms │      90.60ms │     no change │
│ QQuery 3     │    99.39ms │     100.59ms │     no change │
│ QQuery 4     │   752.62ms │     770.88ms │     no change │
│ QQuery 5     │   850.04ms │     847.61ms │     no change │
│ QQuery 6     │     2.22ms │       2.25ms │     no change │
│ QQuery 7     │    43.03ms │      45.67ms │  1.06x slower │
│ QQuery 8     │   906.82ms │     907.40ms │     no change │
│ QQuery 9     │  1188.74ms │    1218.91ms │     no change │
│ QQuery 10    │   262.08ms │     276.10ms │  1.05x slower │
│ QQuery 11    │   298.84ms │     312.09ms │     no change │
│ QQuery 12    │   915.60ms │     932.42ms │     no change │
│ QQuery 13    │  1357.23ms │    1374.77ms │     no change │
│ QQuery 14    │   843.58ms │     876.74ms │     no change │
│ QQuery 15    │  1035.11ms │    1047.61ms │     no change │
│ QQuery 16    │  1752.19ms │    1755.11ms │     no change │
│ QQuery 17    │  1623.80ms │    1626.60ms │     no change │
│ QQuery 18    │  3116.11ms │    3116.02ms │     no change │
│ QQuery 19    │    86.93ms │      84.66ms │     no change │
│ QQuery 20    │  1146.58ms │    1148.49ms │     no change │
│ QQuery 21    │  1351.59ms │    1322.77ms │     no change │
│ QQuery 22    │  2225.44ms │    2207.98ms │     no change │
│ QQuery 23    │  8476.41ms │    8512.21ms │     no change │
│ QQuery 24    │   466.72ms │     482.64ms │     no change │
│ QQuery 25    │   392.36ms │     393.81ms │     no change │
│ QQuery 26    │   536.71ms │     546.16ms │     no change │
│ QQuery 27    │  1681.25ms │    1721.62ms │     no change │
│ QQuery 28    │ 12921.17ms │   12629.70ms │     no change │
│ QQuery 29    │   537.41ms │     531.43ms │     no change │
│ QQuery 30    │   823.54ms │     825.28ms │     no change │
│ QQuery 31    │   862.88ms │     878.91ms │     no change │
│ QQuery 32    │  2661.64ms │    2696.28ms │     no change │
│ QQuery 33    │  3403.66ms │    3404.14ms │     no change │
│ QQuery 34    │  3402.45ms │    3432.14ms │     no change │
│ QQuery 35    │  1294.58ms │    1293.32ms │     no change │
│ QQuery 36    │   124.72ms │     126.12ms │     no change │
│ QQuery 37    │    62.15ms │      57.33ms │ +1.08x faster │
│ QQuery 38    │   127.61ms │     125.38ms │     no change │
│ QQuery 39    │   201.93ms │     208.89ms │     no change │
│ QQuery 40    │    49.64ms │      49.64ms │     no change │
│ QQuery 41    │    46.39ms │      47.07ms │     no change │
│ QQuery 42    │    40.03ms │      39.34ms │     no change │
└──────────────┴────────────┴──────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary           ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)           │ 58105.81ms │
│ Total Time (infer_filter)   │ 58108.12ms │
│ Average Time (HEAD)         │  1351.30ms │
│ Average Time (infer_filter) │  1351.35ms │
│ Queries Faster              │          1 │
│ Queries Slower              │          2 │
│ Queries with No Change      │         40 │
└─────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃     HEAD ┃ infer_filter ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 122.06ms │     121.81ms │     no change │
│ QQuery 2     │  23.31ms │      23.84ms │     no change │
│ QQuery 3     │  34.95ms │      34.94ms │     no change │
│ QQuery 4     │  20.60ms │      21.60ms │     no change │
│ QQuery 5     │  54.23ms │      55.78ms │     no change │
│ QQuery 6     │  11.93ms │      12.03ms │     no change │
│ QQuery 7     │ 103.93ms │     104.94ms │     no change │
│ QQuery 8     │  25.87ms │      27.01ms │     no change │
│ QQuery 9     │  62.76ms │      62.03ms │     no change │
│ QQuery 10    │  57.75ms │      58.35ms │     no change │
│ QQuery 11    │  13.27ms │      12.92ms │     no change │
│ QQuery 12    │  44.48ms │      46.09ms │     no change │
│ QQuery 13    │  28.41ms │      29.99ms │  1.06x slower │
│ QQuery 14    │   9.98ms │       9.98ms │     no change │
│ QQuery 15    │  24.83ms │      25.85ms │     no change │
│ QQuery 16    │  22.97ms │      25.65ms │  1.12x slower │
│ QQuery 17    │  97.96ms │      96.66ms │     no change │
│ QQuery 18    │ 243.04ms │     249.09ms │     no change │
│ QQuery 19    │  28.85ms │      29.68ms │     no change │
│ QQuery 20    │  40.67ms │      39.36ms │     no change │
│ QQuery 21    │ 171.04ms │     171.72ms │     no change │
│ QQuery 22    │  18.48ms │      17.33ms │ +1.07x faster │
└──────────────┴──────────┴──────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary           ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)           │ 1261.37ms │
│ Total Time (infer_filter)   │ 1276.65ms │
│ Average Time (HEAD)         │   57.33ms │
│ Average Time (infer_filter) │   58.03ms │
│ Queries Faster              │         1 │
│ Queries Slower              │         2 │
│ Queries with No Change      │        19 │
└─────────────────────────────┴───────────┘

alamb · 2025-05-07T20:46:02Z

Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look

Support inferring new predicates to push down

d0aac59

github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Apr 30, 2025

xudong963 commented Apr 30, 2025

View reviewed changes

xudong963 requested a review from alamb April 30, 2025 15:37

xudong963 assigned jayzhan211 and unassigned jayzhan211 Apr 30, 2025

xudong963 requested a review from jayzhan211 April 30, 2025 15:37

xudong963 commented Apr 30, 2025

View reviewed changes

alamb marked this pull request as draft May 7, 2025 20:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support inferring new predicates to push down #15906

Support inferring new predicates to push down #15906

xudong963 commented Apr 30, 2025

Uh oh!

xudong963 Apr 30, 2025

Uh oh!

xudong963 Apr 30, 2025

Uh oh!

alamb May 1, 2025

Uh oh!

xudong963 May 1, 2025

Uh oh!

berkaysynnada May 2, 2025

Uh oh!

xudong963 Apr 30, 2025

Uh oh!

Omega359 commented Apr 30, 2025 •

edited

Loading

Uh oh!

alamb commented May 5, 2025

Uh oh!

alamb commented May 5, 2025

Uh oh!

alamb commented May 7, 2025

Uh oh!

Uh oh!

Support inferring new predicates to push down #15906

Are you sure you want to change the base?

Support inferring new predicates to push down #15906

Conversation

xudong963 commented Apr 30, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

xudong963 Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

xudong963 Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

alamb May 1, 2025

Choose a reason for hiding this comment

Uh oh!

xudong963 May 1, 2025

Choose a reason for hiding this comment

Uh oh!

berkaysynnada May 2, 2025

Choose a reason for hiding this comment

Uh oh!

xudong963 Apr 30, 2025

Choose a reason for hiding this comment

Uh oh!

Omega359 commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented May 5, 2025

Uh oh!

alamb commented May 5, 2025

Uh oh!

alamb commented May 7, 2025

Uh oh!

Uh oh!

Omega359 commented Apr 30, 2025 •

edited

Loading