Fix PruningPredicate interaction with DynamicFilterPhysicalExpr that references partition columns#19129
Conversation
|
Todo: add test that is always true. Add test for nested / complex literal trees. |
| enable_page_index: false, | ||
| enable_bloom_filter: false, | ||
| enable_row_group_stats_pruning: true, | ||
| enable_row_group_stats_pruning: false, // note that this is false! |
There was a problem hiding this comment.
Otherwise the test failed because the predicate would successfully prune based on stats
…19130) This improves handling of constant expressions during pruning by trying to evaluate them in the simplifier and the pruning machinery. This is somewhat redundant with #19129 in the simple case of our Parquet implementation but since there may be edge cases where one is hit and not the other, or where users are using them independently I thought it best to implement both approaches.
eebccef to
b1c49b4
Compare
b1c49b4 to
7d3efc8
Compare
| /// Count of distinct column references in an expression. | ||
| /// This is the same as [`collect_columns`] but optimized to stop counting | ||
| /// once more than one distinct column is found. | ||
| /// | ||
| /// For example, in expression `col1 + col2`, the count is `Many`. | ||
| /// In expression `col1 + 5`, the count is `One`. | ||
| /// In expression `5 + 10`, the count is `Zero`. | ||
| #[derive(Debug, PartialEq, Eq)] | ||
| enum ColumnReferenceCount { |
There was a problem hiding this comment.
This replaces collect_columns because:
- We only ever want to know if there's one or more, this short circuits / avoids extra work if we're going to bail anyway.
- Makes the match statements clearer instead of matching on
.len()integers. - Avoids
columns.iter().first().unwrap()later on (even though this does still contain an unwrap internally)
|
|
||
| // Test that always-true literal predicates don't prune any containers | ||
| #[test] | ||
| fn row_group_predicate_literal_true() { |
There was a problem hiding this comment.
Can we please also add a test for literal (boolean) null?
There was a problem hiding this comment.
Added row_group_predicate_literal_null
| prune_with_expr(lit(true).or(lit(false)), &schema, &statistics, &[true]); | ||
|
|
||
| // Complex nested: (1 < 2) AND (3 > 1) = true AND true = true | ||
| prune_with_expr( |
There was a problem hiding this comment.
Can you also please add an error test for pruning with a non boolean (e.g. lit(1i32)) -- and just verify that it errors resonably (rather than gives the wrong answer)
|
|
||
| // Handle literal-to-literal comparisons (no columns on either side) | ||
| // e.g., lit(1) = lit(2) should evaluate to false and prune all containers | ||
| if left_columns.is_empty() && right_columns.is_empty() { |
There was a problem hiding this comment.
This seems redundant with the constant folding introduced in #19130
Why do we need both? Maybe we just need to constant fold the expression after applying the physical expr adapter 🤔
It does overlap with that work. My reasoning for doing it in two places was that these are two disconnected APIs (i.e. we don't require you to run the simplifier before calling |
|
@alamb I've removed the handling of literals and instead added documentation and integration tests. So this PR is now tests + refactoring to short circuit |
a60b943 to
49ebece
Compare
|
Okay I did find the one case that this covers: This will generate a dynamic filter that references |
I've copied that change over to here, it seems more appropriate and fits in with the original goal of this PR |
| /// | ||
| /// Returns a `[`Transformed`] indicating whether a snapshot was taken, | ||
| /// along with the resulting `PhysicalExpr`. | ||
| pub fn snapshot_physical_expr_opt( |
There was a problem hiding this comment.
The idea here is that instead of doing 1 traversal to determine if it's a dynamic expression and another to snapshot we can do a single traversal. This also handles the case where an arbitrary PhysicalExpr implements snapshotting that is not a dynamic filter.
| .with("c1", ContainerStats::new_i32(vec![Some(0)], vec![Some(10)])); | ||
| let expected_ret = &[true]; | ||
| prune_with_expr(lit(1), &schema, &statistics, expected_ret); | ||
| } |
There was a problem hiding this comment.
@alamb this is the other test you asked for
f6b92f8 to
ecad024
Compare
| // which does not handle dynamic exprs in general | ||
| let expr = snapshot_physical_expr(expr)?; | ||
| /// | ||
| /// Note that `PruningPredicate` does not attempt to normalize or simplify |
There was a problem hiding this comment.
it seems to me like PruningPredicate does now actually call simplify 🤔 (if it is a snapshot )
| // children after snapshotting and previously `replace_columns_with_literals` may have been called with partition values | ||
| // the expression we have now is `8 < 5 and col < 10`. | ||
| // Thus we need as simplifier pass to get `false and col < 10` => `false` here. | ||
| let simplifier = PhysicalExprSimplifier::new(&schema); |
There was a problem hiding this comment.
Since this code is specific to dynamic expressions, maybe the call to simplify would make more sense in the snapshot_physical_expr_opt method itself?
There was a problem hiding this comment.
Hmm interesting. I think maybe best to keep things as is. E.g. if you're going to evaluate the expression against data (as opposed to doing the kind of weird stuff PruningPredicate does) then maybe you don't want to pay the simplify cost?
| expr.apply(|expr| { | ||
| if let Some(column) = expr.as_any().downcast_ref::<phys_expr::Column>() { | ||
| seen.insert(column.clone()); | ||
| if seen.len() > 1 { |
There was a problem hiding this comment.
I am surprised clippy didn't complain about this not using is_empty 🤔
There was a problem hiding this comment.
I think because len() > 1 != len >= 1
PhysicalExprSimplifierbeforePruningPredicatewe are able to prune.