[SPARK-27915][SQL][WIP] Update logical Filter's output nullability based on IsNotNull conditions #24765

JoshRosen · 2019-06-01T18:04:56Z

What changes were proposed in this pull request?

This PR changes the logical Filter operator to update its outputs' nullability when filter conditions imply that outputs cannot be null.

In addition, I refined similar existing logic in the physical FilterExec (changing the existing code to be more precise / less conservative in its non-nullability inference) and improved propagation of inferred nullability information in Project.

This is useful because of how it composes with other optimizations: Spark has several logical and physical optimizations which leverage non-nullability, so improving nullability inference increases the value of those existing optimizations.

⚠️ Disclaimers ⚠️

This is a work-in-progress / skunkworks side project; I'm not working on this full time.
Nullability has been a major source of bugs in the past: this PR requires careful review.
I haven't run analyzer / optimizer performance benchmarks, so there's a decent chance that this WIP changeset regresses query planning performance.
DataFrames / Datasets / queries' .schema may change as a result of this optimization: this may have consequences in case nullability information is used by downstream systems (e.g. for CREATE TABLE DDL).
The schemas of analyzed and optimized logical plans may now differ in terms of field nullability (because optimization might infer additional constraints which allow us to prove that fields are non-null).

Examples

Consider the query

SELECT key
FROM t
WHERE key IS NOT NULL

where t.key is nullable.

Because of the key IS NOT NULL filter condition, key will always be non-null. Prior to this patch, this query's result schema was overly-conservative, continuing to mark key as nullable. However, if we take advantage of the key IS NOT NULL condition we can set nullable = false for key.

This was a somewhat trivial example, so let's look at some more complicated cases:

Consider

SELECT A.key, A.value
FROM A, B
WHERE
    A.key = B.key AND
    (A.num + B.num) > 0

where all columns of A and B are nullable. Because of the equality join condition, we know that key must be non-null in both tables. In addition, the condition (A.num + B.num) > 0 can only hold if both num values are not null: addition is a null-intolerant operator, meaning that it returns null if any of its operands is null.

Leveraging this, we should be able to mark both key and value as non-null in the join result's schema (even though both values are nullable in the underlying input relation).

Finally, let's look at an example of a non null-intolerant operator: coalesce(a, b) IS NOT NULL could still mean that a or b is null, so in

SELECT key, foo, COALESCE(foo, bar) as qux
FROM A
WHERE COALESCE(foo, bar) > 0

we can infer that qux is not null but cannot make any claims about foo or bar's nullability.

Description of changes

Introduce PredicateHelper.getImpliedNotNullExprIds(IsNotNull) helper, which takes an IsNotNull expression and returns the ExprIds of expressions which cannot be null. This handles simple cases like IsNotNull(columnFromTable), as well as more complex cases involving expression trees (properly accounting for null-(in)tolerance).
- There was similar existing logic in FilterExec, but I think it was overly conservative: given IsNotNull(x), it would claim that x and all of its descendants were not null if and only if every ancestor of x was NullIntolerant. However, even if x is null-tolerant we can still make claims about x's non-nullability even if we can't make further claims about its children.
Update logical.Filter to leverage this new function to update output nullability.
Modify FilterExec to re-use this logic. This part is a bit tricky because the FilterExec code looks at IsNotNull expressions both for optimizing the order of expression evaluation and for refining nullability to elide null checks in downstream operators.
Modify logical.Project so that inferred non-nullability information from child operators is preserved.

Background on related historical changes / bugs

While developing this patch, I found the following historical PRs to be useful references (note: many of these original PRs contained correctness bugs which were subsequently fixed in later PRs):

[SPARK-12957][SQL] Initial support for constraint propagation in SparkSQL #10844 first introduced constraint inference and propagation in Spark SQL (including basic extraction of IsNotNull conditions from expressions).
[SPARK-13751] [SQL] generate better code for Filter #11585 modified FilterExec to update output nullability based on IsNotNull conditions.
[SPARK-13981][SQL] Defer evaluating variables within Filter operator. #11792 modified FilterExec to add special handling for IsNotNull expression codegen, altering evaluation order to allow for better short-circuiting.
[SPARK-13995][SQL] Extract correct IsNotNull constraints for Expression #11809 introduced the NullIntolerant trait to generalize the IsNotNull extraction logic.
[SPARK-13996][SQL] Add more not null attributes for Filter codegen #11810 updates FilterExec's IsNotNull-handling path to use NullIntolerant
[SPARK-17981] [SPARK-17957] [SQL] Fix Incorrect Nullability Setting to False in FilterExec #15523 fixed a bug in FilterExec's logic: it did not properly account for null-tolerant operators which were ancestors of IsNotNull expressions.
[SPARK-17897] [SQL] Fixed IsNotNull Constraint Inference Rule #16067 fixed a bug related to negation and IsNotNull.

How was this patch tested?

Added new tests for the added PredicateHelper.getImpliedNotNullExprIds.

TODO: add new end-to-end tests reflecting the examples listed above (in order to properly test the integration of this new logic into logical.Filter and logical.Project).

…l filter operations.

…ses.

JoshRosen · 2019-06-01T18:09:04Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

+    // However, if g is NOT NullIntolerant (e.g. if g(null) is non-null) then we cannot
+    // conclude anything about x's nullability.
+    def getExprIdIfNamed(expr: Expression): Set[ExprId] = expr match {
+      case ne: NamedExpression => Set(ne.toAttribute.exprId)


Maybe this should be AttributeReference? I couldn't remember offhand how to get ExprIds from arbitrary expressions, hence this hack.

Use AttributeSet?

SparkQA · 2019-06-01T19:21:05Z

Test build #106056 has finished for PR 24765 at commit 05e4bcf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2019-06-01T21:07:45Z

jenkins retest this please

JoshRosen · 2019-06-01T21:53:28Z

sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala

+  override def usedInputs: AttributeSet = AttributeSet.empty
+
  // Split out all the IsNotNulls from condition.
  private val (notNullPreds, otherPreds) = splitConjunctivePredicates(condition).partition {


I found the old code here to be slightly confusing because it seemed to be using notNullPreds for two different purposes:

If we see IsNotNull conjuncts in the filter then evaluate them first / earlier because (a) these expressions are cheap to evaluate and may allow for short-circuiting and skipping more expensive expressions, and (b) evaluating these earlier allows other expressions to omit null checks (for example, if we have IsNotNull(x) and x * 100 < 10 then we already implicitly need to null-check x as part of the second expression so we might as well do the explicit null check expression first).

Given that tuples have successfully passed through the filter, we can rely on the presence of IsNotNull checks to default subsequent expressions' null checks to false. For example, let's say we had a .filter().select() which gets compiled into a single whole stage codegen: after tuples have passed through the filter we know that certain fields cannot possibly be null, so we can elide null checks at codegen time by just setting nullable = false in subsequent code.

There might be some subtleties related in (1) related to non-deterministic expressions, but I think that's accounted for further down at the place where we're actually generating the checks.

In the old code, the (notNullPreds, otherPreds) on this line was being used for both purposes: for (1) I think we could simply collect all IsNotNull expressions, but the existing implementation of (2) relied on the additional nullIntolerant / a.references checks in order to be correct.

In this PR, I've separated these two usages: the "update nullability for downstream operators" now uses the more precise condition implemented in getImpliedNotNullExprIds, while the "optimize short-circuiting" simply checks for IsNotNull and ignores child attributes.

SparkQA · 2019-06-01T22:02:26Z

Test build #106058 has finished for PR 24765 at commit 1ad4d49.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-01T22:31:50Z

Test build #106057 has finished for PR 24765 at commit 05e4bcf.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-06-01T23:33:04Z

Test build #106059 has finished for PR 24765 at commit a10632f.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2019-06-09T23:47:05Z

This seems to break tests in InferFiltersFromConstraintsSuite because it causes join conditions to differ in their attribute references' nullability. What I think is happening here is that the reference's nullability is determined during analysis, so when we're analyzing correctAnswer we end up recognizing the reference as nullable because the plan is already in its final form (with inferred isNotNull conditions), whereas in the optimized answer those conditions are added after analysis. This change-of-nullability between analysis and optimization ends up breaking the tests. I'm not sure how to fix this.

JoshRosen · 2019-07-15T02:41:13Z

/cc @maropu, who submitted a very similar change ~1 year prior in #21148 (I was unaware of that PR when I created this one).

Chasing down references from that PR, I discovered #23390 and #23508, both of which are concerned with fixing up nullability in attribute references; maybe one of those holds the trick to fixing the blocker identified in my previous comment.

maropu · 2019-07-16T03:50:46Z

Yea, thanks for revisiting this, @JoshRosen! I remember we have the two suggestions from @gatorsmile and @cloud-fan in the previous discussion; 1) nullability is just a hint for the optimizer and it might be good to add a new trait for this hint. And, 2) the optimization for Filter.output is not common in use cases and it is more important to fix the same issue in Join.output. So, I'm currently not sure that this is a right approach (I agree to fix this issue though).

maropu · 2019-07-16T03:54:54Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala

+    val childOutputNullability = child.output.map(a => a.exprId -> a.nullable).toMap
+    projectList
+      .map(_.toAttribute)
+      .map{ a => childOutputNullability.get(a.exprId).map(a.withNullability).getOrElse(a) }


We need to fix this part? It seems UpdateAttributeNullability could handle this case if Filter.output works well?

…ames in PlanTestBase.comparePlans failures ## What changes were proposed in this pull request? This pr proposes to add a prefix '*' to non-nullable attribute names in PlanTestBase.comparePlans failures. In the current master, nullability mismatches might generate the same error message for left/right logical plans like this; ``` // This failure message was extracted from apache#24765 - constraints should be inferred from aliased literals *** FAILED *** == FAIL: Plans do not match === !'Join Inner, (two#0 = a#0) 'Join Inner, (two#0 = a#0) :- Filter (isnotnull(a#0) AND (2 <=> a#0)) :- Filter (isnotnull(a#0) AND (2 <=> a#0)) : +- LocalRelation <empty>, [a#0, b#0, c#0] : +- LocalRelation <empty>, [a#0, b#0, c#0] +- Project [2 AS two#0] +- Project [2 AS two#0] +- LocalRelation <empty>, [a#0, b#0, c#0] +- LocalRelation <empty>, [a#0, b#0, c#0] (PlanTest.scala:145) ``` With this pr, this error message is changed to one below; ``` - constraints should be inferred from aliased literals *** FAILED *** == FAIL: Plans do not match === !'Join Inner, (*two#0 = a#0) 'Join Inner, (*two#0 = *a#0) :- Filter (isnotnull(a#0) AND (2 <=> a#0)) :- Filter (isnotnull(a#0) AND (2 <=> a#0)) : +- LocalRelation <empty>, [a#0, b#0, c#0] : +- LocalRelation <empty>, [a#0, b#0, c#0] +- Project [2 AS two#0] +- Project [2 AS two#0] +- LocalRelation <empty>, [a#0, b#0, c#0] +- LocalRelation <empty>, [a#0, b#0, c#0] (PlanTest.scala:145) ``` ## How was this patch tested? N/A Closes apache#25213 from maropu/MarkForNullability. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

github-actions · 2019-12-29T00:05:58Z

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

JoshRosen added 6 commits May 31, 2019 20:04

Initial WIP towards sharing similar logic between logical and physica…

b0182ac

…l filter operations.

Add explanation / derivation; refine implementation to handle more ca…

b950474

…ses.

Fix basicPhysicalOperators usage.

33b579c

Add unit test; fix a couple of bugs

acc98f8

Fix unresolved attribute error; rollback comment edit.

fa89706

Propagate Project child nullability

05e4bcf

JoshRosen commented Jun 1, 2019

View reviewed changes

Relocate code to shrink effective diff

1ad4d49

Fix excess whitespace

a10632f

dongjoon-hyun added the SQL label Jun 14, 2019

maropu reviewed Jul 16, 2019

View reviewed changes

JoshRosen mentioned this pull request Sep 25, 2019

[SPARK-29213][SQL] Generate extra IsNotNull predicate in FilterExec #25902

Closed

github-actions bot added the Stale label Dec 29, 2019

github-actions bot closed this Dec 30, 2019

[SPARK-27915][SQL][WIP] Update logical Filter's output nullability based on IsNotNull conditions #24765

[SPARK-27915][SQL][WIP] Update logical Filter's output nullability based on IsNotNull conditions #24765

Conversation

JoshRosen commented Jun 1, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Examples

Description of changes

Background on related historical changes / bugs

How was this patch tested?

Uh oh!

JoshRosen Jun 1, 2019

Choose a reason for hiding this comment

Uh oh!

maropu Jul 16, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 1, 2019

Uh oh!

JoshRosen commented Jun 1, 2019

Uh oh!

JoshRosen Jun 1, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 1, 2019

Uh oh!

SparkQA commented Jun 1, 2019

Uh oh!

SparkQA commented Jun 1, 2019

Uh oh!

JoshRosen commented Jun 9, 2019

Uh oh!

JoshRosen commented Jul 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maropu commented Jul 16, 2019

Uh oh!

maropu Jul 16, 2019

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 29, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JoshRosen commented Jun 1, 2019 •

edited

Loading

JoshRosen commented Jul 15, 2019 •

edited

Loading