[SPARK-27033][SQL]Add Optimize rule RewriteArithmeticFiltersOnIntegralColumn #23942

WangGuangxin · 2019-03-03T06:54:04Z

What changes were proposed in this pull request?

Currently, filters like select * from table where a + 1 = 3 cannot be pushed down, this optimizer can convert it to select * from table where a = 3 - 1, and then optimized to select * from table where a = 2 by other optimizer, so that it can be pushed down to parquet or other file format.

The comparison supports =, !=. The operation supports Add and Subtract. It only supports integral-type (Byte, Short,INT and LONG), it doesn't support FLOAT/DOUBLE for precision issues.

How was this patch tested?

Unit test by RewriteArithmeticFiltersOnIntegralColumnSuite.

dongjoon-hyun

Hi, @WangGuangxin . Thank you for your first contribution.

First of all, could you create another file for this optimizer?
Second, could you rename this optimizer name to more specific one? TransformBinaryComparison looks too broad claim to me because this optimizer only aims +/- on Int/Long. It cannot handle * and / and other many data types.

WangGuangxin · 2019-03-04T03:14:55Z

Hi, @WangGuangxin . Thank you for your first contribution.

First of all, could you create another file for this optimizer?

Second, could you rename this optimizer name to more specific one? TransformBinaryComparison looks too broad claim to me because this optimizer only aims +/- on Int/Long. It cannot handle * and / and other many data types.

I renamed it to RewriteArithmeticFiltersOnIntOrLongColumn and put it into a single file.

maropu · 2019-03-04T09:19:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala

@@ -128,7 +128,8 @@ abstract class Optimizer(sessionCatalog: SessionCatalog)
        RemoveRedundantAliases,
        RemoveNoopOperators,
        SimplifyExtractValueOps,
-        CombineConcats) ++
+        CombineConcats,
+        RewriteArithmeticFiltersOnIntOrLongColumn) ++


It would be better to put this rule just before ConstantFolding?

yes, I've put it before ConstantFolding

maropu · 2019-03-04T09:19:52Z

ok to test

SparkQA · 2019-03-04T11:08:09Z

Test build #102980 has finished for PR 23942 at commit aaf2b8a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-04T13:04:28Z

Test build #102994 has finished for PR 23942 at commit 82ff2a1.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

WangGuangxin · 2019-03-04T14:22:01Z

Test build #102994 has finished for PR 23942 at commit 82ff2a1.

This patch fails Spark unit tests.

This patch merges cleanly.

This patch adds no public classes.

This fails the test org.apache.spark.sql.hive.OptimizeHiveMetadataOnlyQuerySuite.SPARK-23877: filter on projected expression becasue it supposes part + 1 < 5 will not be pushed down, which is just what this PR does. cc @rdblue for confirmation.

dongjoon-hyun · 2019-03-04T16:41:23Z

Please update SPARK-23877: filter on projected expression to use other expressions, @WangGuangxin .

SparkQA · 2019-03-05T03:51:13Z

Test build #103016 has finished for PR 23942 at commit 1fe9a41.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-05T08:05:02Z

Test build #103024 has finished for PR 23942 at commit 597d6d7.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2019-03-05T08:43:02Z

...cala/org/apache/spark/sql/catalyst/optimizer/RewriteArithmeticFiltersOnIntOrLongColumn.scala

+ *   SELECT * FROM table WHERE i = 2
+ * }}}
+ */
+object RewriteArithmeticFiltersOnIntOrLongColumn extends Rule[LogicalPlan] with PredicateHelper {


Is this restrictive to Filter only? Looks like it rewrites all qualified expressions in all logical plan.

yes, it is filter only. I've changed it to only work on Filter

SparkQA · 2019-03-05T14:38:26Z

Test build #103042 has finished for PR 23942 at commit 03c522e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

WangGuangxin · 2019-03-05T14:49:37Z

Please update SPARK-23877: filter on projected expression to use other expressions, @WangGuangxin .

Done. Update two related unit tests.

SongYadong · 2019-03-06T09:44:56Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala

-    assert(checkNotPushdown(sql("SELECT * FROM foobar WHERE (THEID + 2) != 4")).collect().size == 2)
+    // SPARK-27033: Add Optimize rule RewriteArithmeticFiltersOnIntOrLongColumn
+    assert(checkPushdown(sql("SELECT * FROM foobar WHERE (THEID + 1) < 2")).collect().size == 0)
+    assert(checkPushdown(sql("SELECT * FROM foobar WHERE (THEID + 2) != 4")).collect().size == 2)


Is "!=" also supported? The PR description only mentions "=, >=, <=, >, <".

yes, updated the PR

SongYadong · 2019-03-06T09:45:38Z

...cala/org/apache/spark/sql/catalyst/optimizer/RewriteArithmeticFiltersOnIntOrLongColumn.scala

+ * {{{
+ *   SELECT * FROM table WHERE i = 2
+ * }}}
+ */


It will be good to doc supported comparison operators here.

Thanks for your advice. I have updated the comments here

mgaido91 · 2019-03-06T11:46:02Z

...cala/org/apache/spark/sql/catalyst/optimizer/RewriteArithmeticFiltersOnIntOrLongColumn.scala

+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+    case f: Filter =>
+      f transformExpressionsUp {
+        case e @ BinaryComparison(left: BinaryArithmetic, right: Literal)


what about checking if it is foldable instead of a Literal?

yes, foldable is better to accelerate convergence, I'll change it

SparkQA · 2019-03-06T16:12:58Z

Test build #103090 has finished for PR 23942 at commit 0f61953.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2019-03-06T16:24:21Z

...cala/org/apache/spark/sql/catalyst/optimizer/RewriteArithmeticFiltersOnIntOrLongColumn.scala

+    }
+  }
+
+  private def isDataTypeSafe(dataType: DataType): Boolean = dataType match {


why only integer and longs are accepted?

Float and Double has precision issues. For example, a + 3.2 < 4.0 will be convert to a < 0.7999999999999998.

How about the other integral types, e.g., short?

yes, integral types(byte, short, int, long) are all ok. I'll add support for byte and short type as well.

I think the precision issue would be there anyway, when executed at runtime, am I wrong?

I think the precision issue would be there anyway, when executed at runtime, am I wrong?

I have a simple test on a table with a Double type column a, it has two records: 0.7999999999999998 and 0.8

with a + 3.2 = 4.0, it returns both two records. But if we optimized it to a = 0.7999999999999998, the result will be wrong

Here is an example:

scala> spark.sql("select float(1E-8) + float(1E+10) <= float(1E+10)").show() +----------------------------------------------------------------------+ |((CAST(1E-8 AS FLOAT) + CAST(1E+10 AS FLOAT)) <= CAST(1E+10 AS FLOAT))| +----------------------------------------------------------------------+ | true| +----------------------------------------------------------------------+ scala> spark.sql("select float(1E-8) <= float(1E+10) - float(1E+10)").show() +----------------------------------------------------------------------+ |(CAST(1E-8 AS FLOAT) <= (CAST(1E+10 AS FLOAT) - CAST(1E+10 AS FLOAT)))| +----------------------------------------------------------------------+ | false| +----------------------------------------------------------------------+

Although float(1E-8) + float(1E+10) <= float(1E+10) should return false. This may lead to inconsistency.

thanks for the explanation

mgaido91 · 2019-03-06T16:25:06Z

...cala/org/apache/spark/sql/catalyst/optimizer/RewriteArithmeticFiltersOnIntOrLongColumn.scala

+  def apply(plan: LogicalPlan): LogicalPlan = plan transform {
+    case f: Filter =>
+      f transformExpressionsUp {
+        case e @ BinaryComparison(left: BinaryArithmetic, right: Expression)


is it safe to do it also for non-deterministic expressions?

foldable is enough because in ConstantFolding it will convert all folding expressions to Literal. And in fact non-deterministic is not foldable

yes but what if the remaining part of left is non-determistic?

There is a check in transformRight and transformLeft to make sure the other part of BinaryArithmetic is foldable

yes, I was thinking about the AttributeReference, but it is always deterministic. So I think it is fine, thanks.

mgaido91 · 2019-03-06T16:26:03Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/OptimizeHiveMetadataOnlyQuerySuite.scala

@@ -65,11 +65,11 @@ class OptimizeHiveMetadataOnlyQuerySuite extends QueryTest with TestHiveSingleto

      // verify the matching partitions
      val partitions = spark.internalCreateDataFrame(Distinct(Filter(($"x" < 5).expr,
-        Project(Seq(($"part" + 1).as("x").expr.asInstanceOf[NamedExpression]),
+        Project(Seq(($"part" * 1).as("x").expr.asInstanceOf[NamedExpression]),


why do we need this?

Because with this optimizer in this PR, part + 1 < 5 will be optimized to 'part < 4' , where part is a partition column, so it only need to fetch 4 partitions instead of 11, so the last assert assert(HiveCatalogMetrics.METRIC_PARTITIONS_FETCHED.getCount - startCount == 11) will fail.
From the comments in this test, it wants to test the case where verify that the partition predicate was not pushed down to the metastore, so I changed to part * 1, which will not be optimized.

How about using the spark.sql.optimizer.excludedRules config instead of changing the existing tests?

How about using the spark.sql.optimizer.excludedRules config instead of changing the existing tests?

Thanks for your advice. I've make change accordingly.

SparkQA · 2019-03-06T20:00:15Z

Test build #103096 has finished for PR 23942 at commit 5d0a5e8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-03-07T13:15:48Z

...cala/org/apache/spark/sql/catalyst/optimizer/RewriteArithmeticFiltersOnIntOrLongColumn.scala

+/**
+ * Rewrite arithmetic filters on int or long column to its equivalent form,
+ * leaving attribute alone in one side, so that we can push it down to
+ * parquet or other file format.


nit: how about this?

/** * Rewrite arithmetic filters on an integral-type (e.g., int and long) column to its equivalent * form, leaving attribute alone in a left side, so that we can push it down to * datasources (e.g., Parquet and ORC). * * For example, this rule can optimize a query as follows: * {{{ * SELECT * FROM table WHERE i + 3 = 5 * ==> SELECT * FROM table WHERE i = 5 - 3 * }}} * * Then, the [[ConstantFolding]] rule will further optimize it as follows: * {{{ * SELECT * FROM table WHERE i = 2 * }}} * * Note: * 1. This rule supports `Add` and `Subtract` in arithmetic expressions. * 2. This rule supports `=`, `>=`, `<=`, `>`, `<`, and `!=` in comparators. * 3. This rule supports `INT` and `LONG` types only. It doesn't support `FLOAT` or `DOUBLE` * because of precision issues. */

It's more clearly. I've updated it.

SparkQA · 2019-03-08T07:15:45Z

Test build #103189 has finished for PR 23942 at commit 3927dec.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-08T08:05:01Z

Test build #103197 has finished for PR 23942 at commit 2c02777.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2019-03-08T08:09:35Z

retest this please

maropu · 2019-03-08T10:07:52Z

How do you handle this behaviour change?

// v2.4.0
scala> Seq(0, Int.MaxValue).toDF("v").write.saveAsTable("t")
scala> sql("select * from t").show
+----------+
|         v|
+----------+
|         0|
|2147483647|
+----------+

scala> sql("select * from t where v + 1 > 0").show
+---+
|  v|
+---+
|  0|
+---+

// this pr
scala> sql("select * from t where v + 1 > 0").show
+----------+
|         v|
+----------+
|         0|
|2147483647|
+----------+

mgaido91 · 2019-03-08T10:15:53Z

...scala/org/apache/spark/sql/catalyst/optimizer/RewriteArithmeticFiltersOnIntegralColumn.scala

+    }
+  }
+
+  private def isAddSafe[T](left: Any, right: Any, minValue: T, maxValue: T)(


I don't see the need for this and the next. As of now, we are not handling overflows with integers (you can see #21599 is still open). So I think we can get rid of these checks. It may be worth, though, to add a comment (like a TODO) in order to remind that this issue can arise

With some cases, it doesn't necessarily cause overflow if we don't rewrite it. So there's potential inconsistency again.

The check here is to make sure if overflow may occur after rewrite, it will not rewrite this expression.

SparkQA · 2019-03-08T12:28:14Z

Test build #103201 has finished for PR 23942 at commit 2c02777.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

WangGuangxin · 2019-03-10T07:32:43Z

How do you handle this behaviour change?

// v2.4.0
scala> Seq(0, Int.MaxValue).toDF("v").write.saveAsTable("t")
scala> sql("select * from t").show
+----------+
|         v|
+----------+
|         0|
|2147483647|
+----------+

scala> sql("select * from t where v + 1 > 0").show
+---+
|  v|
+---+
|  0|
+---+

// this pr
scala> sql("select * from t where v + 1 > 0").show
+----------+
|         v|
+----------+
|         0|
|2147483647|
+----------+

This is a bad case I didn't think about it before. I found there are four kinds of cases.

v + 1 > 0 => v > -1 and v <= Int.MAX - 1
v - 1 > 0 => v > 1 or (v < Int.MIN + 1 && v > 0 - 1 + Int.MIN - Int.MAX )
v + 1 < 0 => v < -1 or (v > Int.MAX -1 && v < 0 - 1 + Int.MAX - Int.MIN)
v - 1 < 0 => v < 1 and v >= Int.MIN + 1

For one inequality, after rewrite, there may need two or three inequalities, which makes expressions much more complex. So I think it doesn't worth to convert inequality. We may only handle = or != here. What do you think?

maropu · 2019-03-11T05:19:56Z

I'm neutral on this, but I feel there is not many queries that this rule could optimize..., WDYT? cc: @cloud-fan

cloud-fan · 2019-03-14T03:23:11Z

I think it's hard to rewrite the comparison expressions to match the behavior of overflow. What's worse, the overflow behavior may get changed in the future, to follow SQL standard and throw exception when overflow happens.

Only handling equal SGTM.

WangGuangxin · 2019-03-15T06:34:50Z

retest it please

SparkQA · 2019-03-15T21:29:07Z

Test build #103542 has finished for PR 23942 at commit 43892ac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-03-25T07:42:40Z

...scala/org/apache/spark/sql/catalyst/optimizer/RewriteArithmeticFiltersOnIntegralColumn.scala

+/**
+ * Rewrite arithmetic filters on an integral-type (e.g., byte, short, int and long)
+ * column to its equivalent form, leaving attribute alone in a left side, so that
+ * we can push it down to datasources (e.g., Parquet and ORC).


cc @liancheng per #8165

AmplabJenkins · 2019-09-16T18:15:38Z

Can one of the admins verify this patch?

github-actions · 2020-01-01T00:06:07Z

We're closing this PR because it hasn't been updated in a while.
This isn't a judgement on the merit of the PR in any way. It's just
a way of keeping the PR queue manageable.

If you'd like to revive this PR, please reopen it!

Add Optimize rule TransformBinaryComparison

45b6f58

dongjoon-hyun requested changes Mar 3, 2019

View reviewed changes

Rename

aaf2b8a

WangGuangxin changed the title ~~[SPARK-27033][SQL]Add Optimize rule TransformBinaryComparison~~ [SPARK-27033][SQL]Add Optimize rule RewriteArithmeticFiltersOnIntOrLongColumn Mar 4, 2019

WangGuangxin force-pushed the SPARK-27033 branch from 0f169a5 to aaf2b8a Compare March 4, 2019 03:13

maropu reviewed Mar 4, 2019

View reviewed changes

Change the order in optimizer

82ff2a1

Fix ut failure

597d6d7

WangGuangxin force-pushed the SPARK-27033 branch from 1fe9a41 to 597d6d7 Compare March 5, 2019 05:17

viirya reviewed Mar 5, 2019

View reviewed changes

Filter only

03c522e

SongYadong reviewed Mar 6, 2019

View reviewed changes

Update doc and add one more test case

0f61953

mgaido91 reviewed Mar 6, 2019

View reviewed changes

Change literal to foldable

5d0a5e8

mgaido91 reviewed Mar 6, 2019

View reviewed changes

maropu reviewed Mar 7, 2019

View reviewed changes

Add supports for ShortType and ByteType

3927dec

WangGuangxin changed the title ~~[SPARK-27033][SQL]Add Optimize rule RewriteArithmeticFiltersOnIntOrLongColumn~~ [SPARK-27033][SQL]Add Optimize rule RewriteArithmeticFiltersOnIntegralColumn Mar 8, 2019

use spark.sql.optimizer.excludedRules

2c02777

mgaido91 reviewed Mar 8, 2019

View reviewed changes

only rewrite EqualTo

43892ac

HyukjinKwon reviewed Mar 25, 2019

View reviewed changes

dongjoon-hyun added the OPTIMIZER label Jun 14, 2019

github-actions bot added the Stale label Jan 1, 2020

github-actions bot closed this Jan 2, 2020

[SPARK-27033][SQL]Add Optimize rule RewriteArithmeticFiltersOnIntegralColumn #23942

[SPARK-27033][SQL]Add Optimize rule RewriteArithmeticFiltersOnIntegralColumn #23942

Conversation

WangGuangxin commented Mar 3, 2019 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

dongjoon-hyun left a comment

Choose a reason for hiding this comment

WangGuangxin commented Mar 4, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu commented Mar 4, 2019

SparkQA commented Mar 4, 2019

SparkQA commented Mar 4, 2019

WangGuangxin commented Mar 4, 2019

dongjoon-hyun commented Mar 4, 2019

SparkQA commented Mar 5, 2019

SparkQA commented Mar 5, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 5, 2019

WangGuangxin commented Mar 5, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SongYadong Mar 6, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 6, 2019

Choose a reason for hiding this comment

WangGuangxin Mar 7, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

WangGuangxin Mar 7, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 6, 2019

maropu Mar 7, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 8, 2019

SparkQA commented Mar 8, 2019

dilipbiswal commented Mar 8, 2019

maropu commented Mar 8, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 8, 2019

WangGuangxin commented Mar 10, 2019

maropu commented Mar 11, 2019

cloud-fan commented Mar 14, 2019

WangGuangxin commented Mar 15, 2019

SparkQA commented Mar 15, 2019

Choose a reason for hiding this comment

AmplabJenkins commented Sep 16, 2019

github-actions bot commented Jan 1, 2020

WangGuangxin commented Mar 3, 2019 •

edited

Loading

SongYadong Mar 6, 2019 •

edited

Loading

WangGuangxin Mar 7, 2019 •

edited

Loading

WangGuangxin Mar 7, 2019 •

edited

Loading

maropu Mar 7, 2019 •

edited

Loading