Skip to content

Support inferring new predicates to push down #15906

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

xudong963
Copy link
Member

Which issue does this PR close?

  • Closes #.

Rationale for this change

We can infer new predicates from existing predicates to push down to reduce IO and improve performance dramatically

What changes are included in this PR?

Infers new predicates by substituting equalities.
For example, with predicates t2.b = 3 and t1.b > t2.b, we can infer t1.b > 3.

See sqllogictest for another case.

Are these changes tested?

Yes

Are there any user-facing changes?

@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Apr 30, 2025
logical_plan
01)Inner Join: t1.a = t2.a
02)--Projection: t1.a, t1.b
03)----Filter: __common_expr_4 >= Int64(3) AND __common_expr_4 <= Int64(5)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inferred predicate, which can be pushed down to the t1 scan.

/// Infers new predicates by substituting equalities.
/// For example, with predicates `t2.b = 3` and `t1.b > t2.b`,
/// we can infer `t1.b > 3`.
fn infer_predicates_from_equalities(predicates: Vec<Expr>) -> Result<Vec<Expr>> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future, we can move the code into a dedicated optimizer rule, such as InferPredicates

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be a special case of the range analysis code in

https://docs.rs/datafusion/latest/datafusion/physical_expr/intervals/cp_solver/index.html

In other words, instead of this special case maybe we could use the cp_solver to create a more general framework for introducing inferred predicates 🤔

Now that we have predicate pushdown for ExecutionPlans maybe it is more realistic to do this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check the cp_solver (didn't notice the part of code before)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion. I can help if you need some directions or have any confusion

@xudong963 xudong963 requested a review from alamb April 30, 2025 15:37
@xudong963 xudong963 assigned jayzhan211 and unassigned jayzhan211 Apr 30, 2025
@xudong963 xudong963 requested a review from jayzhan211 April 30, 2025 15:37
@@ -131,13 +131,25 @@ pub fn normalize_sorts(
}

/// Recursively replace all [`Column`] expressions in a given expression tree with
/// `Column` expressions provided by the hash map argument.
pub fn replace_col(expr: Expr, replace_map: &HashMap<&Column, &Column>) -> Result<Expr> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't wanna write a similar method for the PR, so made the method generic

pub fn replace_col_with_expr(
    expr: Expr,
    replace_map: &HashMap<Column, &Expr>,
) -> Result<Expr> {
    expr.transform(|expr| {
        Ok({
            if let Expr::Column(c) = &expr {
                match replace_map.get(c) {
                    Some(new_expr) => Transformed::yes((**new_expr).to_owned()),
                    None => Transformed::no(expr),
                }
            } else {
                Transformed::no(expr)
            }
        })
    })
    .data()
}

@Omega359
Copy link
Contributor

Omega359 commented Apr 30, 2025

Perhaps running clickbench or equivalent (assuming clickbench wouldn't trigger this optimization) to showcase the difference would be good?

@alamb
Copy link
Contributor

alamb commented May 5, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1013-gcp #13~24.04.1-Ubuntu SMP Wed Apr 2 16:34:16 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing infer_filter (d0aac59) to af99b54 diff
Benchmarks: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented May 5, 2025

🤖: Benchmark completed

Details

Comparing HEAD and infer_filter
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ infer_filter ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │  1918.57ms │    1907.49ms │     no change │
│ QQuery 1     │   676.14ms │     741.17ms │  1.10x slower │
│ QQuery 2     │  1387.03ms │    1520.49ms │  1.10x slower │
│ QQuery 3     │   720.59ms │     712.25ms │     no change │
│ QQuery 4     │  1490.52ms │    1472.76ms │     no change │
│ QQuery 5     │ 15473.75ms │   15353.95ms │     no change │
│ QQuery 6     │  2069.18ms │    2073.35ms │     no change │
│ QQuery 7     │  2875.14ms │    2642.26ms │ +1.09x faster │
└──────────────┴────────────┴──────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary           ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)           │ 26610.92ms │
│ Total Time (infer_filter)   │ 26423.71ms │
│ Average Time (HEAD)         │  3326.36ms │
│ Average Time (infer_filter) │  3302.96ms │
│ Queries Faster              │          1 │
│ Queries Slower              │          2 │
│ Queries with No Change      │          5 │
└─────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃       HEAD ┃ infer_filter ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.35ms │       2.33ms │     no change │
│ QQuery 1     │    39.48ms │      39.13ms │     no change │
│ QQuery 2     │    92.77ms │      90.60ms │     no change │
│ QQuery 3     │    99.39ms │     100.59ms │     no change │
│ QQuery 4     │   752.62ms │     770.88ms │     no change │
│ QQuery 5     │   850.04ms │     847.61ms │     no change │
│ QQuery 6     │     2.22ms │       2.25ms │     no change │
│ QQuery 7     │    43.03ms │      45.67ms │  1.06x slower │
│ QQuery 8     │   906.82ms │     907.40ms │     no change │
│ QQuery 9     │  1188.74ms │    1218.91ms │     no change │
│ QQuery 10    │   262.08ms │     276.10ms │  1.05x slower │
│ QQuery 11    │   298.84ms │     312.09ms │     no change │
│ QQuery 12    │   915.60ms │     932.42ms │     no change │
│ QQuery 13    │  1357.23ms │    1374.77ms │     no change │
│ QQuery 14    │   843.58ms │     876.74ms │     no change │
│ QQuery 15    │  1035.11ms │    1047.61ms │     no change │
│ QQuery 16    │  1752.19ms │    1755.11ms │     no change │
│ QQuery 17    │  1623.80ms │    1626.60ms │     no change │
│ QQuery 18    │  3116.11ms │    3116.02ms │     no change │
│ QQuery 19    │    86.93ms │      84.66ms │     no change │
│ QQuery 20    │  1146.58ms │    1148.49ms │     no change │
│ QQuery 21    │  1351.59ms │    1322.77ms │     no change │
│ QQuery 22    │  2225.44ms │    2207.98ms │     no change │
│ QQuery 23    │  8476.41ms │    8512.21ms │     no change │
│ QQuery 24    │   466.72ms │     482.64ms │     no change │
│ QQuery 25    │   392.36ms │     393.81ms │     no change │
│ QQuery 26    │   536.71ms │     546.16ms │     no change │
│ QQuery 27    │  1681.25ms │    1721.62ms │     no change │
│ QQuery 28    │ 12921.17ms │   12629.70ms │     no change │
│ QQuery 29    │   537.41ms │     531.43ms │     no change │
│ QQuery 30    │   823.54ms │     825.28ms │     no change │
│ QQuery 31    │   862.88ms │     878.91ms │     no change │
│ QQuery 32    │  2661.64ms │    2696.28ms │     no change │
│ QQuery 33    │  3403.66ms │    3404.14ms │     no change │
│ QQuery 34    │  3402.45ms │    3432.14ms │     no change │
│ QQuery 35    │  1294.58ms │    1293.32ms │     no change │
│ QQuery 36    │   124.72ms │     126.12ms │     no change │
│ QQuery 37    │    62.15ms │      57.33ms │ +1.08x faster │
│ QQuery 38    │   127.61ms │     125.38ms │     no change │
│ QQuery 39    │   201.93ms │     208.89ms │     no change │
│ QQuery 40    │    49.64ms │      49.64ms │     no change │
│ QQuery 41    │    46.39ms │      47.07ms │     no change │
│ QQuery 42    │    40.03ms │      39.34ms │     no change │
└──────────────┴────────────┴──────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary           ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)           │ 58105.81ms │
│ Total Time (infer_filter)   │ 58108.12ms │
│ Average Time (HEAD)         │  1351.30ms │
│ Average Time (infer_filter) │  1351.35ms │
│ Queries Faster              │          1 │
│ Queries Slower              │          2 │
│ Queries with No Change      │         40 │
└─────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃     HEAD ┃ infer_filter ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 122.06ms │     121.81ms │     no change │
│ QQuery 2     │  23.31ms │      23.84ms │     no change │
│ QQuery 3     │  34.95ms │      34.94ms │     no change │
│ QQuery 4     │  20.60ms │      21.60ms │     no change │
│ QQuery 5     │  54.23ms │      55.78ms │     no change │
│ QQuery 6     │  11.93ms │      12.03ms │     no change │
│ QQuery 7     │ 103.93ms │     104.94ms │     no change │
│ QQuery 8     │  25.87ms │      27.01ms │     no change │
│ QQuery 9     │  62.76ms │      62.03ms │     no change │
│ QQuery 10    │  57.75ms │      58.35ms │     no change │
│ QQuery 11    │  13.27ms │      12.92ms │     no change │
│ QQuery 12    │  44.48ms │      46.09ms │     no change │
│ QQuery 13    │  28.41ms │      29.99ms │  1.06x slower │
│ QQuery 14    │   9.98ms │       9.98ms │     no change │
│ QQuery 15    │  24.83ms │      25.85ms │     no change │
│ QQuery 16    │  22.97ms │      25.65ms │  1.12x slower │
│ QQuery 17    │  97.96ms │      96.66ms │     no change │
│ QQuery 18    │ 243.04ms │     249.09ms │     no change │
│ QQuery 19    │  28.85ms │      29.68ms │     no change │
│ QQuery 20    │  40.67ms │      39.36ms │     no change │
│ QQuery 21    │ 171.04ms │     171.72ms │     no change │
│ QQuery 22    │  18.48ms │      17.33ms │ +1.07x faster │
└──────────────┴──────────┴──────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary           ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)           │ 1261.37ms │
│ Total Time (infer_filter)   │ 1276.65ms │
│ Average Time (HEAD)         │   57.33ms │
│ Average Time (infer_filter) │   58.03ms │
│ Queries Faster              │         1 │
│ Queries Slower              │         2 │
│ Queries with No Change      │        19 │
└─────────────────────────────┴───────────┘

@alamb alamb marked this pull request as draft May 7, 2025 20:45
@alamb
Copy link
Contributor

alamb commented May 7, 2025

Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
logical-expr Logical plan and expressions optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants