[SPARK-13495][SQL] Add Null Filters in the query plan for Filters/Joins based on their data constraints #11372

sameeragarwal · 2016-02-25T18:47:20Z

What changes were proposed in this pull request?

This PR adds an optimizer rule to eliminate reading (unnecessary) NULL values if they are not required for correctness by inserting isNotNull filters is the query plan. These filters are currently inserted beneath existing Filter and Join operators and are inferred based on their data constraints.

Note: While this optimization is applicable to all types of join, it primarily benefits Inner and LeftSemi joins.

How was this patch tested?

Added a new NullFilteringSuite that tests for IsNotNull filters in the query plan for joins and filters. Also, tests interaction with the CombineFilters optimizer rules.
Test generated ExpressionTrees via OrcFilterSuite
Test filter source pushdown logic via SimpleTextHadoopFsRelationSuite

cc @yhuai @nongli

SparkQA · 2016-02-25T19:11:29Z

Test build #51980 has finished for PR 11372 at commit 06d74da.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-26T01:17:39Z

Test build #51994 has finished for PR 11372 at commit 2345075.

This patch fails from timeout after a configured wait of 250m.
This patch merges cleanly.
This patch adds no public classes.

yhuai · 2016-02-29T18:26:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

I am wondering if the optimizer is the right place for this rule. My main concern is that if we can preserve this ordering through the rest of query compilation. Will it be better to do it inside the physical Filter operator (just before we start to generate the code)?

yes, that sounds like a good idea! Could there be any other downside of not doing it in the optimizer? /cc @nongli

SparkQA · 2016-03-02T22:34:02Z

Test build #52338 has finished for PR 11372 at commit 28050b3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- abstract class PlanTest extends SparkFunSuite with PredicateHelper

SparkQA · 2016-03-03T09:35:39Z

Test build #52383 has finished for PR 11372 at commit 2a469e8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-03T21:35:31Z

Test build #52406 has finished for PR 11372 at commit 80dab7e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-04T01:02:03Z

Test build #52416 has finished for PR 11372 at commit 013f97a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sameeragarwal · 2016-03-04T01:34:00Z

test this please

SparkQA · 2016-03-04T03:37:04Z

Test build #52431 has finished for PR 11372 at commit 013f97a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nongli · 2016-03-04T04:49:13Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

 }

+/**
+ * Attempts to eliminate reading (unnecessary) NULL values if they are not required for correctness


"in the query plan"

sameeragarwal · 2016-03-04T23:34:28Z

Thanks @nongli, all comments addressed.

SparkQA · 2016-03-05T01:42:09Z

Test build #52494 has finished for PR 11372 at commit 31b1700.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

nongli · 2016-03-07T19:55:55Z

LGTM

…ns based on their data constraints ## What changes were proposed in this pull request? This PR adds an optimizer rule to eliminate reading (unnecessary) NULL values if they are not required for correctness by inserting `isNotNull` filters is the query plan. These filters are currently inserted beneath existing `Filter` and `Join` operators and are inferred based on their data constraints. Note: While this optimization is applicable to all types of join, it primarily benefits `Inner` and `LeftSemi` joins. ## How was this patch tested? 1. Added a new `NullFilteringSuite` that tests for `IsNotNull` filters in the query plan for joins and filters. Also, tests interaction with the `CombineFilters` optimizer rules. 2. Test generated ExpressionTrees via `OrcFilterSuite` 3. Test filter source pushdown logic via `SimpleTextHadoopFsRelationSuite` cc yhuai nongli Author: Sameer Agarwal <sameer@databricks.com> Closes apache#11372 from sameeragarwal/gen-isnotnull.

sameeragarwal force-pushed the gen-isnotnull branch from 06d74da to 2345075 Compare February 25, 2016 21:02

yhuai reviewed Feb 29, 2016
View reviewed changes

sameeragarwal force-pushed the gen-isnotnull branch from 2345075 to c08b7fb Compare March 2, 2016 21:08

NullFiltering rule in catalyst

cc4323f

sameeragarwal force-pushed the gen-isnotnull branch from c08b7fb to 28050b3 Compare March 2, 2016 21:09

sameeragarwal changed the title ~~[WIP][SPARK-13495][SQL] Add Null Filters in the query plan for Filters/Joins based on their data constraints~~ [SPARK-13495][SQL] Add Null Filters in the query plan for Filters/Joins based on their data constraints Mar 2, 2016

unit tests

28050b3

sameeragarwal added 2 commits March 2, 2016 19:12

Fix OrcFilterSuite

0b1520c

Fix SimpleTextHadoopFsRelationSuite

2a469e8

Add isNotNull handling in SimpleTextRelation

80dab7e

Fix PlannerSuite and ParquetFilterSuite

013f97a

nongli reviewed Mar 4, 2016
View reviewed changes

Nong's comments

31b1700

asfgit closed this in ef77003 Mar 7, 2016

sameeragarwal mentioned this pull request Mar 9, 2016

[SPARK-13751] [SQL] generate better code for Filter #11585

Closed

This was referenced Mar 11, 2016

[SPARK-13811][SPARK-13836] [SQL] Removed IsNotNull Constraints of Compound Expressions And Generated IsNotNull Constraints inside Not #11649

Closed

[SPARK-13869][SQL] Remove redundant conditions while combining filters #11670

Closed

[SPARK-13495][SQL] Add Null Filters in the query plan for Filters/Joins based on their data constraints #11372

[SPARK-13495][SQL] Add Null Filters in the query plan for Filters/Joins based on their data constraints #11372

Uh oh!

Conversation

sameeragarwal commented Feb 25, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Feb 25, 2016

Uh oh!

SparkQA commented Feb 26, 2016

Uh oh!

yhuai Feb 29, 2016

Choose a reason for hiding this comment

Uh oh!

sameeragarwal Feb 29, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 2, 2016

Uh oh!

SparkQA commented Mar 3, 2016

Uh oh!

SparkQA commented Mar 3, 2016

Uh oh!

SparkQA commented Mar 4, 2016

Uh oh!

sameeragarwal commented Mar 4, 2016

Uh oh!

SparkQA commented Mar 4, 2016

Uh oh!

nongli Mar 4, 2016

Choose a reason for hiding this comment

Uh oh!

sameeragarwal commented Mar 4, 2016

Uh oh!

SparkQA commented Mar 5, 2016

Uh oh!

nongli commented Mar 7, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants