-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-27915][SQL][WIP] Update logical Filter's output nullability based on IsNotNull conditions #24765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…l filter operations.
| // However, if g is NOT NullIntolerant (e.g. if g(null) is non-null) then we cannot | ||
| // conclude anything about x's nullability. | ||
| def getExprIdIfNamed(expr: Expression): Set[ExprId] = expr match { | ||
| case ne: NamedExpression => Set(ne.toAttribute.exprId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be AttributeReference? I couldn't remember offhand how to get ExprIds from arbitrary expressions, hence this hack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use AttributeSet?
|
Test build #106056 has finished for PR 24765 at commit
|
|
jenkins retest this please |
| override def usedInputs: AttributeSet = AttributeSet.empty | ||
|
|
||
| // Split out all the IsNotNulls from condition. | ||
| private val (notNullPreds, otherPreds) = splitConjunctivePredicates(condition).partition { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found the old code here to be slightly confusing because it seemed to be using notNullPreds for two different purposes:
- If we see
IsNotNullconjuncts in the filter then evaluate them first / earlier because (a) these expressions are cheap to evaluate and may allow for short-circuiting and skipping more expensive expressions, and (b) evaluating these earlier allows other expressions to omit null checks (for example, if we haveIsNotNull(x)andx * 100 < 10then we already implicitly need to null-checkxas part of the second expression so we might as well do the explicit null check expression first). - Given that tuples have successfully passed through the filter, we can rely on the presence of
IsNotNullchecks to default subsequent expressions' null checks tofalse. For example, let's say we had a.filter().select()which gets compiled into a single whole stage codegen: after tuples have passed through the filter we know that certain fields cannot possibly be null, so we can elide null checks at codegen time by just settingnullable = falsein subsequent code.
There might be some subtleties related in (1) related to non-deterministic expressions, but I think that's accounted for further down at the place where we're actually generating the checks.
In the old code, the (notNullPreds, otherPreds) on this line was being used for both purposes: for (1) I think we could simply collect all IsNotNull expressions, but the existing implementation of (2) relied on the additional nullIntolerant / a.references checks in order to be correct.
In this PR, I've separated these two usages: the "update nullability for downstream operators" now uses the more precise condition implemented in getImpliedNotNullExprIds, while the "optimize short-circuiting" simply checks for IsNotNull and ignores child attributes.
|
Test build #106058 has finished for PR 24765 at commit
|
|
Test build #106057 has finished for PR 24765 at commit
|
|
Test build #106059 has finished for PR 24765 at commit
|
|
This seems to break tests in |
|
/cc @maropu, who submitted a very similar change ~1 year prior in #21148 (I was unaware of that PR when I created this one). Chasing down references from that PR, I discovered #23390 and #23508, both of which are concerned with fixing up nullability in attribute references; maybe one of those holds the trick to fixing the blocker identified in my previous comment. |
|
Yea, thanks for revisiting this, @JoshRosen! I remember we have the two suggestions from @gatorsmile and @cloud-fan in the previous discussion; 1) nullability is just a hint for the optimizer and it might be good to add a new trait for this hint. And, 2) the optimization for |
| val childOutputNullability = child.output.map(a => a.exprId -> a.nullable).toMap | ||
| projectList | ||
| .map(_.toAttribute) | ||
| .map{ a => childOutputNullability.get(a.exprId).map(a.withNullability).getOrElse(a) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to fix this part? It seems UpdateAttributeNullability could handle this case if Filter.output works well?
…ames in PlanTestBase.comparePlans failures ## What changes were proposed in this pull request? This pr proposes to add a prefix '*' to non-nullable attribute names in PlanTestBase.comparePlans failures. In the current master, nullability mismatches might generate the same error message for left/right logical plans like this; ``` // This failure message was extracted from apache#24765 - constraints should be inferred from aliased literals *** FAILED *** == FAIL: Plans do not match === !'Join Inner, (two#0 = a#0) 'Join Inner, (two#0 = a#0) :- Filter (isnotnull(a#0) AND (2 <=> a#0)) :- Filter (isnotnull(a#0) AND (2 <=> a#0)) : +- LocalRelation <empty>, [a#0, b#0, c#0] : +- LocalRelation <empty>, [a#0, b#0, c#0] +- Project [2 AS two#0] +- Project [2 AS two#0] +- LocalRelation <empty>, [a#0, b#0, c#0] +- LocalRelation <empty>, [a#0, b#0, c#0] (PlanTest.scala:145) ``` With this pr, this error message is changed to one below; ``` - constraints should be inferred from aliased literals *** FAILED *** == FAIL: Plans do not match === !'Join Inner, (*two#0 = a#0) 'Join Inner, (*two#0 = *a#0) :- Filter (isnotnull(a#0) AND (2 <=> a#0)) :- Filter (isnotnull(a#0) AND (2 <=> a#0)) : +- LocalRelation <empty>, [a#0, b#0, c#0] : +- LocalRelation <empty>, [a#0, b#0, c#0] +- Project [2 AS two#0] +- Project [2 AS two#0] +- LocalRelation <empty>, [a#0, b#0, c#0] +- LocalRelation <empty>, [a#0, b#0, c#0] (PlanTest.scala:145) ``` ## How was this patch tested? N/A Closes apache#25213 from maropu/MarkForNullability. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
|
We're closing this PR because it hasn't been updated in a while. If you'd like to revive this PR, please reopen it! |
What changes were proposed in this pull request?
This PR changes the logical
Filteroperator to update its outputs' nullability when filter conditions imply that outputs cannot be null.In addition, I refined similar existing logic in the physical
FilterExec(changing the existing code to be more precise / less conservative in its non-nullability inference) and improved propagation of inferred nullability information inProject.This is useful because of how it composes with other optimizations: Spark has several logical and physical optimizations which leverage non-nullability, so improving nullability inference increases the value of those existing optimizations.
.schemamay change as a result of this optimization: this may have consequences in case nullability information is used by downstream systems (e.g. forCREATE TABLEDDL).Examples
Consider the query
where
t.keyis nullable.Because of the
key IS NOT NULLfilter condition,keywill always be non-null. Prior to this patch, this query's result schema was overly-conservative, continuing to markkeyas nullable. However, if we take advantage of thekey IS NOT NULLcondition we can setnullable = falseforkey.This was a somewhat trivial example, so let's look at some more complicated cases:
Consider
where all columns of
AandBare nullable. Because of the equality join condition, we know thatkeymust be non-null in both tables. In addition, the condition(A.num + B.num) > 0can only hold if bothnumvalues are not null: addition is a null-intolerant operator, meaning that it returnsnullif any of its operands is null.Leveraging this, we should be able to mark both
keyandvalueas non-null in the join result's schema (even though both values are nullable in the underlying input relation).Finally, let's look at an example of a non null-intolerant operator:
coalesce(a, b) IS NOT NULLcould still mean thataorbis null, so inwe can infer that
quxis not null but cannot make any claims aboutfooorbar's nullability.Description of changes
PredicateHelper.getImpliedNotNullExprIds(IsNotNull)helper, which takes anIsNotNullexpression and returns theExprIds of expressions which cannot be null. This handles simple cases likeIsNotNull(columnFromTable), as well as more complex cases involving expression trees (properly accounting for null-(in)tolerance).FilterExec, but I think it was overly conservative: givenIsNotNull(x), it would claim thatxand all of its descendants were not null if and only if every ancestor ofxwasNullIntolerant. However, even ifxis null-tolerant we can still make claims aboutx's non-nullability even if we can't make further claims about its children.logical.Filterto leverage this new function to update output nullability.FilterExecto re-use this logic. This part is a bit tricky because theFilterExeccode looks atIsNotNullexpressions both for optimizing the order of expression evaluation and for refining nullability to elide null checks in downstream operators.logical.Projectso that inferred non-nullability information from child operators is preserved.Background on related historical changes / bugs
While developing this patch, I found the following historical PRs to be useful references (note: many of these original PRs contained correctness bugs which were subsequently fixed in later PRs):
IsNotNullconditions from expressions).FilterExecto update output nullability based onIsNotNullconditions.FilterExecto add special handling forIsNotNullexpression codegen, altering evaluation order to allow for better short-circuiting.NullIntoleranttrait to generalize theIsNotNullextraction logic.FilterExec'sIsNotNull-handling path to useNullIntolerantFilterExec's logic: it did not properly account for null-tolerant operators which were ancestors ofIsNotNullexpressions.IsNotNull.How was this patch tested?
Added new tests for the added
PredicateHelper.getImpliedNotNullExprIds.TODO: add new end-to-end tests reflecting the examples listed above (in order to properly test the integration of this new logic into
logical.Filterandlogical.Project).