-
Notifications
You must be signed in to change notification settings - Fork 1.8k
#17801 Improve nullability reporting of case expressions #17813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d2e613a to
88a911b
Compare
88a911b to
482d0be
Compare
4bbaa82 to
7f8d7cf
Compare
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR @pepijnve --
I am not quite sure about this implementation (I am hoping #17628 might solve the problem too with more sophisticated case folding)
However, I verified it does solve the problem with running the benchmarks so from that perspective I think we should proceed
My only real concern is that the newly added tests cover only the new code, and not the "end to end" behavior you tracked down (namely that the case pattern with coalesce changes the nullability).
Would it be possible to add some of the cases as expr simplification tests too? Somewhere like here?
| #[test] |
datafusion/expr/src/expr_schema.rs
Outdated
| when(binary_expr(col("foo"), Operator::Eq, lit(5)), col("foo")) | ||
| .otherwise(lit(0))?, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: You can probably make this more concise using the eq method, something like this:
| when(binary_expr(col("foo"), Operator::Eq, lit(5)), col("foo")) | |
| .otherwise(lit(0))?, | |
| when(col("foo").eq(lit(5))), col("foo")).otherwise(lit(0))?, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likewise there is Expr::and for ands that could be used as well below
However, the current setup of using and as a prefix is pretty clear too, so maybe what you have here is actually more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I missed that. I was looking for prefix versions, and hadn't realised infix ones existed too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up sticking with prefix notation for the boolean combinators and infix for the rest. Using infix for the boolean made it hard to read. I've also added the SQL equivalent as a comment.
datafusion/expr/src/expr_schema.rs
Outdated
| assert!(expr.nullable(&get_schema(false)).unwrap()); | ||
| } | ||
|
|
||
| fn check_nullability( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this a little confusing at first, because it makes an explicit assumption that expr's will never introduce nulls (in order for !expr.nullable(&get_schema(false))?, to be true). So for example, it wouldn't do the right thing with the NULLIF function NULLIF(foo, 25) or something
Maybe some comments would help
| fn check_nullability( | |
| /// Verifies that `expr` has `nullable` nullability when the 'foo' column is | |
| /// null. | |
| /// Also assumes and verifies that `expr` is NOT nullable when 'foo' is NOT null | |
| fn check_nullability( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reworked the logical plan test cases already to (hopefully) make it more obvious what's going on. I hadn't given this function much thought since it was only a test thing.
datafusion/expr/src/expr_schema.rs
Outdated
| check_nullability( | ||
| when(binary_expr(col("foo"), Operator::Eq, lit(5)), col("foo")) | ||
| .otherwise(lit(0))?, | ||
| true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically this could also be reported as false, given that if foo is null, then the expr resolves to 0 (non null)
> create table t(foo int) as values (0), (NULL), (5);
0 row(s) fetched.
Elapsed 0.001 seconds.
> select foo, CASE WHEN foo=5 THEN foo ELSE 0 END from t;
+------+---------------------------------------------------------+
| foo | CASE WHEN t.foo = Int64(5) THEN t.foo ELSE Int64(0) END |
+------+---------------------------------------------------------+
| 0 | 0 |
| NULL | 0 |
| 5 | 5 |
+------+---------------------------------------------------------+
3 row(s) fetched.
Elapsed 0.002 seconds.However, maybe we can improve that in a follow on PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, the const evaluation is far from complete. I tried to do something good enough for the coalesce simplification initially.
I was wondering the whole time if there isn't some existing null analysis logic somewhere in the codebase we could reuse. The best I could come up with is rewriting the full expression by replacing the then expression with literal NULL and then attempting const evaluation. But that got me worrying about planning overhead again.
datafusion/expr/src/expr_schema.rs
Outdated
| when( | ||
| or( | ||
| is_not_null(col("foo")), | ||
| binary_expr(col("foo"), Operator::Eq, lit(5)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above, I don't think this expression can everr be true so this overall expression is still non nullable
datafusion/expr/src/expr_schema.rs
Outdated
| col("foo"), | ||
| ) | ||
| .otherwise(lit(0))?, | ||
| true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too -- this expression is not nullabile
> select foo, CASE WHEN foo=5 OR foo IS NOT NULL THEN foo ELSE 0 END from t;
+------+------------------------------------------------------------------------------+
| foo | CASE WHEN t.foo = Int64(5) OR t.foo IS NOT NULL THEN t.foo ELSE Int64(0) END |
+------+------------------------------------------------------------------------------+
| 0 | 0 |
| NULL | 0 |
| 5 | 5 |
+------+------------------------------------------------------------------------------+
3 row(s) fetched.
Elapsed 0.002 seconds.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, if you comment out the filter step (i.e. revert to the pre-patch version) all of these cases are reported as being nullable. The scope of this PR is to get at least some cases that are definitely not nullable reported as such, not ensure all cases are reported correctly.
datafusion/expr/src/expr_schema.rs
Outdated
| .otherwise(lit(0))?, | ||
| true, | ||
| get_schema, | ||
| )?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also please add a check with is_null in the OR clause (which should be null)
Something like the equivalent to
> select foo, CASE WHEN foo=5 OR foo IS NULL THEN foo ELSE 0 END from t;
+------+--------------------------------------------------------------------------+
| foo | CASE WHEN t.foo = Int64(5) OR t.foo IS NULL THEN t.foo ELSE Int64(0) END |
+------+--------------------------------------------------------------------------+
| 0 | 0 |
| NULL | NULL |
| 5 | 5 |
+------+--------------------------------------------------------------------------+
3 row(s) fetched.
Elapsed 0.000 seconds.Like
check_nullability(
when(
or(
binary_expr(col("foo"), Operator::Eq, lit(5)),
is_null(col("foo")),
),
col("foo"),
)
.otherwise(lit(0))?,
true,
get_schema,
)?;There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added this test case
I warned you it wasn't very elegant. 😄 I don't think #17628 covers the same thing though. What we're trying to do here is get
I'm not sure what kind of test you have in mind. The end to end case is (admittedly very indirectly) covered by TPC-DS query 75 and the removal of the double optimisation. If you revert the production code change in this PR, but keep the test change you'll see that it fails. For the simplifier itself, I was wondering if there shouldn't be some internal assertions that verifies that the result of calling |
9516262 to
a6ab83a
Compare
|
@alamb thinking about this a bit more. I'm going to struggle expressing myself sufficiently clearly here, but I'll try to explain the idea behind what I'm doing. Maybe that can help us figure out a better way to express the idea. What I'm trying to do is improve the accuracy of the predicate In particular there's one interesting case (pun not intended) which results from the What I attempted to do in this PR is to look at the more general form I tried to implement this in a cheap, but imprecise way. My rationale was that even though it's not perfect, it's an improvement in accuracy over the current code. |
|
I've massaged the logical plan version of the code a bit further already to hopefully clarify what it's doing. I then ran the test cases with logging output rather than assertions before and after the extra filtering to illustrate what's being changed. After the change all tests pass. Before the patch it reports the following |
|
@alamb I've taken the logical expression portion of the PR another step further which ensures correct answers for the expressions you mentioned earlier. I can complete the physical expression portion as well if you like. Unless you tell me this path is a dead end. |
Thank you -- I will try and get to this one asap. Somehow every time i think I am getting the queue of reviews under control there are like 50 new notifications ! It is a good problem to have. |
No pressure from my side. I just write up my notes and move on to the next thing. Async delayed response is fine. |
|
I experimented a bit with the rewrite + const eval approach on the physical expression side of things. While attractive and simple to implement, the downside is that it's going to be very hard to ensure the logical and physical side agree. Logical needs to work without |
|
Than you -- this is on my list of things to review shortly |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @pepijnve -- this is looking so close -- I think we should roll back some of the GuaranteeRewriter changes to avoid API churn. If you could break them out into their own PR I think that would be good and I could review them quickly
| /// ```text | ||
| /// A ∧ B │ F U T | ||
| /// ──────┼────── | ||
| /// F │ F F F |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this truth table seems to be missing the values for A (self)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the left column. Let me see if I can figure out a compact way to clarify that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tables were based on https://en.wikipedia.org/wiki/Three-valued_logic#Kleene_and_Priest_logics
I've tweaked them a bit further to resemble those as closely as possible.
| /// This method uses the following truth table. | ||
| /// | ||
| /// ```text | ||
| /// A ∨ B │ F U T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likewise here this table is missing values for A
| /// See a full example in [`ExprSimplifier::with_guarantees()`]. | ||
| /// | ||
| /// [`ExprSimplifier::with_guarantees()`]: crate::simplify_expressions::expr_simplifier::ExprSimplifier::with_guarantees | ||
| pub struct GuaranteeRewriter<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless there is a good reason, I think we should avoid removing this API as it will cause unecessary churn on downstream crates
If you find rewrite_with_guarantees easier to work with, maybe you leave GuaranteeRewriter and but implement rewrite_with_guarantees in terms of that
| use datafusion_common::{DataFusionError, HashMap, Result}; | ||
| use datafusion_expr_common::interval_arithmetic::{Interval, NullableInterval}; | ||
|
|
||
| struct GuaranteeRewriter<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the GuaranteeRewriter is part of the public API, so making this change would potentially cause breaking downstream changes: https://docs.rs/datafusion/latest/datafusion/optimizer/simplify_expressions/struct.GuaranteeRewriter.html
I think we should leave the GuaranteeRewriter API in place (w/ comments etc) and then make rewrite_with_guarantees a method or something
Perhaps
impl GuaranteeRewriter {
/// Create new guarantees from an iterator
pub fn new(
guarantees: impl IntoIterator<Item = &'a (Expr, NullableInterval)>,
)
/// Create new gurantees from a map
pub fn new(
guarantees: &'a HashMap<&'a Expr, &'a NullableInterval>,
)
}🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, it's a breaking change. It's already breaking simply because of the move from one crate to another unless we add a reexport from optimizer.
No objections to restoring public visibility of the struct though. I was just trying to follow the example/style of the order by rewrite sibling on the new module location.
| /// | ||
| /// See a full example in [`ExprSimplifier::with_guarantees()`]. | ||
| /// | ||
| /// [`ExprSimplifier::with_guarantees()`]: crate::simplify_expressions::expr_simplifier::ExprSimplifier::with_guarantees |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably removed this doc link b/c the code is now in a different module that doesn't depend on optimizer.
However I think the link still adds value
What we have done in other places where we can't rely on auto links is to use the direct HTML link: https://docs.rs/datafusion/latest/datafusion/optimizer/simplify_expressions/struct.ExprSimplifier.html#method.with_guarantees
Which isn't as good as rustdoc doesn't check that the links don't get broken, but I think it is better than just removing the link totally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that's why I removed it here. Will restore.
I'll make this a separate PR taking the comments you logged so far into account. It'll be easier to track that way. |
## Which issue does this PR close? - None, break out PR of changes done in #17813 ## Rationale for this change In #17813 `GuaranteeRewriter` is used from the `datafusion_expr` crate. In order to enable this the type needed to be moved from `datafusion_optimizer` to `datafusion_expr`. Additionally, during the development of #17813 some latent bugs were discovered in `GuaranteeRewriter` that have been resolved. ## What changes are included in this PR? - Move `GuaranteeRewriter` to `datafusion_expr` - Fix two bugs where rewrites of 'between' expression would fail - when one of the bounds was untyped null - when the lower bound was greater than the upper bound - Add logic to replace expressions with literal null based on provided guarantees - Split implementation into smaller functions for easier readability ## Are these changes tested? - Existing tests updated - Tests added for bugfixes ## Are there any user-facing changes? No --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
# Conflicts: # datafusion/expr/src/expr_rewriter/guarantees.rs # datafusion/expr/src/expr_rewriter/mod.rs # datafusion/optimizer/src/simplify_expressions/mod.rs
|
The changes from #18821 have been merged into this PR from |
|
Thank you @pepijnve -- I plan to give this one a final review tomorrow morning and merge it in |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Expr::Column(c) => input_schema.nullable(c), | ||
| Expr::OuterReferenceColumn(field, _) => Ok(field.is_nullable()), | ||
| Expr::Literal(value, _) => Ok(value.is_null()), | ||
| Expr::Case(case) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While re-reading this I can't help but think the logic is quite non trivial - and someone trying to figure out if an expression is nullable on a deeply nested function might end up calling this function many times
Not for this PR, but I think we should consider how to cache or otherwise avoid re-computing the same nullabilty (and DataType) expressions over and over again.
I'll writeup a follow on ticket
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's absolutely correct. Performance overhead concerns were the main reason I had initially avoided rewriting the expression and instead tried to do the rewrite indirectly. Rather than rewriting using a NullableInterval::Null guarantee, I was checking this using a callback function.
It's probably feasible, but non-trivial to cache this result. What would you use as storage location?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See https://github.com/apache/datafusion/pull/17813/files#r2545958309. That already mitigates the additional calculations a little bit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably feasible, but non-trivial to cache this result. What would you use as storage location?
Yes, I agree it is non trivial. I wrote up some ideas in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started looking at the possible options here already a bit. I don't immediately see a simple solution.
|
🤖 |
| Some(Ok(())) | ||
| } | ||
| }) | ||
| .next(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change from collect to next().is_some() does mitigate the performance overhead a little bit. As soon as one nullable branch is found the iteration will stop.
|
🤖: Benchmark completed Details
|
|
@alamb the variant of the code that did not use I went ahead and made that change at pepijnve#2 |
|
Sounds good -- will run |
|
Since this PR has been outstanding for so long and it fixes a bug and I desperately want to close it down (so we can move on) I am going to merge it as is. @pepijnve would you be willing to make a real PR with the change from |
My browser just let out a sigh of relief. GitHub's UI was struggling with this one.
Certainly |
😆 thank you for sticking with it -- I think the code overall (not just case reporting) is significantly better because of your work
🙏 |
## Which issue does this PR close? - None, break out PR of changes done in apache#17813 ## Rationale for this change In apache#17813 `GuaranteeRewriter` is used from the `datafusion_expr` crate. In order to enable this the type needed to be moved from `datafusion_optimizer` to `datafusion_expr`. Additionally, during the development of apache#17813 some latent bugs were discovered in `GuaranteeRewriter` that have been resolved. ## What changes are included in this PR? - Move `GuaranteeRewriter` to `datafusion_expr` - Fix two bugs where rewrites of 'between' expression would fail - when one of the bounds was untyped null - when the lower bound was greater than the upper bound - Add logic to replace expressions with literal null based on provided guarantees - Split implementation into smaller functions for easier readability ## Are these changes tested? - Existing tests updated - Tests added for bugfixes ## Are there any user-facing changes? No --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
…e#17813) ## Which issue does this PR close? - Closes apache#17801 - Obviates (contains) and thus Closes apache#17833 - Obviates (contains) and thus Closes apache#18536 ## Rationale for this change apache#17357 introduced a change that replaces `coalesce` function calls with `case` expressions. In the current implementation these two differ in the way they report their nullability. `coalesce` is more precise than `case` all will report itself as not nullable in situations where the equivalent `case` does report being nullable. The rest of the codebase currently does not expect the nullability property of an expression to change as a side effect of expression simplification. This PR is a first attempt to align the nullability of `coalesce` and `case`. ## What changes are included in this PR? Tweaks to the `nullable` logic for the logical and physical `case` expression code to report `case` as being not nullable in more situations. - For logical `case`, a best effort const evaluation of 'when' expressions is done to determine 'then' reachability. The code errs on the conservative side wrt nullability. - For physical `case`, const evaluation of 'when' expressions using a placeholder record batch is attempted to determine 'then' reachability. Again if const evaluation is not possible, the code errs on the conservative side. - The optimizer schema check has been relaxed slightly to allow nullability to be removed by optimizer passes without having to disable the schema check entirely - The panic'ing benchmark has been reenabled ## Are these changes tested? Additional unit tests have been added to test the new logic. ## Are there any user-facing changes? No --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Which issue does this PR close?
Rationale for this change
#17357 introduced a change that replaces
coalescefunction calls withcaseexpressions. In the current implementation these two differ in the way they report their nullability.coalesceis more precise thancaseall will report itself as not nullable in situations where the equivalentcasedoes report being nullable.The rest of the codebase currently does not expect the nullability property of an expression to change as a side effect of expression simplification. This PR is a first attempt to align the nullability of
coalesceandcase.What changes are included in this PR?
Tweaks to the
nullablelogic for the logical and physicalcaseexpression code to reportcaseas being not nullable in more situations.case, a best effort const evaluation of 'when' expressions is done to determine 'then' reachability. The code errs on the conservative side wrt nullability.case, const evaluation of 'when' expressions using a placeholder record batch is attempted to determine 'then' reachability. Again if const evaluation is not possible, the code errs on the conservative side.Are these changes tested?
Additional unit tests have been added to test the new logic.
Are there any user-facing changes?
No