[SPARK-24402] [SQL] Optimize `In` expression when only one element in the collection or collection is empty #21797

dbtsai · 2018-07-17T17:38:04Z

What changes were proposed in this pull request?

Two new rules in the logical plan optimizers are added.

When there is only one element in the Collection, the
physical plan will be optimized to EqualTo, so predicate
pushdown can be used.

    profileDF.filter( $"profileID".isInCollection(Set(6))).explain(true)
    """
      |== Physical Plan ==
      |*(1) Project [profileID#0]
      |+- *(1) Filter (isnotnull(profileID#0) && (profileID#0 = 6))
      |   +- *(1) FileScan parquet [profileID#0] Batched: true, Format: Parquet,
      |     PartitionFilters: [],
      |     PushedFilters: [IsNotNull(profileID), EqualTo(profileID,6)],
      |     ReadSchema: struct<profileID:int>
    """.stripMargin

When the Collection is empty, and the input is nullable, the
logical plan will be simplified to

    profileDF.filter( $"profileID".isInCollection(Set())).explain(true)
    """
      |== Optimized Logical Plan ==
      |Filter if (isnull(profileID#0)) null else false
      |+- Relation[profileID#0] parquet
    """.stripMargin

TODO:

For multiple conditions with numbers less than certain thresholds,
we should still allow predicate pushdown.
Optimize the In using tableswitch or lookupswitch
when the numbers of the categories are low, and they are Int,
Long.
The default immutable hash trees set is slow for query, and we
should do benchmark for using different set implementation for faster
query.
filter(if (condition) null else false) can be optimized to false.

How was this patch tested?

Couple new tests are added.

dbtsai · 2018-07-17T20:38:36Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala

+          // TODO: `EqualTo` for structural types are not working. Until SPARK-24443 is addressed,
+          // TODO: we exclude them in this rule.
+          && !v.isInstanceOf[CreateNamedStructLike]
+          && !newList.head.isInstanceOf[CreateNamedStructLike]) {


@cloud-fan @gatorsmile until #21470 is merged, let's exclude CreateNamedStructLike in this rule.

SparkQA · 2018-07-17T20:48:37Z

Test build #93186 has finished for PR 21797 at commit 8bc7573.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-17T22:28:28Z

Test build #93192 has finished for PR 21797 at commit 79d14f1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2018-07-17T22:54:45Z

Test it again.

SparkQA · 2018-07-18T00:19:39Z

Test build #93203 has finished for PR 21797 at commit 68e9f04.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-07-18T00:28:11Z

Test build #93202 has finished for PR 21797 at commit 2e62027.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2018-07-18T00:34:04Z

Merged into master as it passes the build now.

Optimize In

8bc7573

dbtsai force-pushed the optimize-in branch from 1a0cd0b to 8bc7573 Compare July 17, 2018 17:39

dbtsai added 3 commits July 17, 2018 11:27

Fix the build

79d14f1

Ignore StrucuteType in the new rule

2e62027

Add doc

68e9f04

dbtsai commented Jul 17, 2018

View reviewed changes

asfgit closed this in 681845f Jul 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-24402] [SQL] Optimize `In` expression when only one element in the collection or collection is empty #21797

[SPARK-24402] [SQL] Optimize `In` expression when only one element in the collection or collection is empty #21797

Uh oh!

dbtsai commented Jul 17, 2018

Uh oh!

dbtsai Jul 17, 2018

Uh oh!

SparkQA commented Jul 17, 2018

Uh oh!

SparkQA commented Jul 17, 2018

Uh oh!

dbtsai commented Jul 17, 2018

Uh oh!

SparkQA commented Jul 18, 2018

Uh oh!

SparkQA commented Jul 18, 2018

Uh oh!

dbtsai commented Jul 18, 2018

Uh oh!

Uh oh!

[SPARK-24402] [SQL] Optimize In expression when only one element in the collection or collection is empty #21797

[SPARK-24402] [SQL] Optimize In expression when only one element in the collection or collection is empty #21797

Uh oh!

Conversation

dbtsai commented Jul 17, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

dbtsai Jul 17, 2018

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 17, 2018

Uh oh!

SparkQA commented Jul 17, 2018

Uh oh!

dbtsai commented Jul 17, 2018

Uh oh!

SparkQA commented Jul 18, 2018

Uh oh!

SparkQA commented Jul 18, 2018

Uh oh!

dbtsai commented Jul 18, 2018

Uh oh!

Uh oh!

[SPARK-24402] [SQL] Optimize `In` expression when only one element in the collection or collection is empty #21797

[SPARK-24402] [SQL] Optimize `In` expression when only one element in the collection or collection is empty #21797