[SPARK-40903][SQL] Avoid reordering decimal Add for canonicalization if data type is changed #38379

gengliangwang · 2022-10-24T20:48:53Z

What changes were proposed in this pull request?

Avoid reordering Add for canonicalizing if it is decimal type and the result data type is changed.
Expressions are canonicalized for comparisons and explanations. For non-decimal Add expression, the order can be sorted by hashcode, and the result is supposed to be the same.
However, for Add expression of Decimal type, the behavior is different: Given decimal (p1, s1) and another decimal (p2, s2), the result integral part is max(p1-s1, p2-s2) +1, the result decimal part is max(s1, s2). Thus the result data type is (max(p1-s1, p2-s2) +1 + max(s1, s2), max(s1, s2)).
Thus the order matters:

For (decimal(12,5) + decimal(12,6)) + decimal(3, 2), the first add decimal(12,5) + decimal(12,6) results in decimal(14, 6), and then decimal(14, 6) + decimal(3, 2) results in decimal(15, 6)
For (decimal(12, 6) + decimal(3,2)) + decimal(12, 5), the first add decimal(12, 6) + decimal(3,2) results in decimal(13, 6), and then decimal(13, 6) + decimal(12, 5) results in decimal(14, 6)

In the following query:

create table foo(a decimal(12, 5), b decimal(12, 6)) using orc
select sum(coalesce(a+b+1.75, a)) from foo

At first coalesce(a+b+ 1.75, a) is resolved as coalesce(a+b+ 1.75, cast(a as decimal(15, 6)). In the canonicalized version, the expression becomes coalesce(1.75+b+a, cast(a as decimal(15, 6)). As explained above, 1.75+b+a is of decimal(14, 6), which is different from cast(a as decimal(15, 6). Thus the following error will happen:

java.lang.IllegalArgumentException: requirement failed: All input types must be the same except nullable, containsNull, valueContainsNull flags. The input types found are
	DecimalType(14,6)
	DecimalType(15,6)
	at scala.Predef$.require(Predef.scala:281)
	at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck(Expression.scala:1149)
	at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck$(Expression.scala:1143)

This PR is to fix the bug.

Why are the changes needed?

Bug fix

Does this PR introduce any user-facing change?

No

How was this patch tested?

A new test case

gengliangwang · 2022-10-24T20:49:37Z

cc @peter-toth @ulysses-you

gengliangwang · 2022-10-24T21:46:24Z

I confirmed that the regression is caused by the refactoring PR #36698. Before the refactor, the query will look like

coalesce(CheckOverflow((promote_precision(cast(CheckOverflow((promote_precision(cast(a#0 as decimal(14,6))) + promote_precision(cast(b#1 as decimal(14,6)))), DecimalType(14,6), true) as decimal(15,6))) + promote_precision(cast(1.75 as decimal(15,6)))), DecimalType(15,6), true), a#0)

All the children of Add are cast as the final data type. Thus reordering Add for canonicalization won’t matter.

cloud-fan · 2022-10-25T01:15:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala

@@ -477,7 +477,10 @@ case class Add(
  override protected def withNewChildrenInternal(newLeft: Expression, newRight: Expression): Add =
    copy(left = newLeft, right = newRight)

-  override lazy val canonicalized: Expression = {
+  override lazy val canonicalized: Expression = dataType match {
+    case _: DecimalType =>


can we add some comments to explain the reason?

thank you @gengliangwang for the catching.
can we make it more fine-grained ? Not all decimal add will fail, so we can check if we can reorder them safely. e.g., precision and scale in all left and right are same.

@cloud-fan comment added

@ulysses-you Are you sure about that? My concern is that if both left and right contains integer contains decimal Adds, the result may still be different after sorting all the sub Adds

How about adding an extra Cast into the canonicalized form if needed like:

override lazy val canonicalized: Expression = { // TODO: do not reorder consecutive `Add`s with different `evalMode` val reordered = orderCommutative({ case Add(l, r, _) => Seq(l, r) }).reduce(Add(_, _, evalMode)) if (dataType != reordered.dataType) { Cast(reordered, dataType) } else { reordered } }

Hmm, maybe adding an extra Cast is not a good idea as the 2 expressions with different dataTypes shouldn't be considered equal, but if reordered's data type matches the original then why can't we reorder?

@peter-toth The ideal solution would be adding extra casts in all the canonicalization of the children ofComplexTypeMergingExpression if the data type is changed. However, there are also overriding in some of the ComplexTypeMergingExpression.
So I would take your suggestion to reorder the Add if the result data type is not changed. Thank you.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>

ulysses-you · 2022-10-26T02:05:51Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala

-    orderCommutative({ case Add(l, r, _) => Seq(l, r) }).reduce(Add(_, _, evalMode))
+    val reorderResult =
+      orderCommutative({ case Add(l, r, _) => Seq(l, r) }).reduce(Add(_, _, evalMode))
+    if (resolved && reorderResult.resolved && reorderResult.dataType == dataType) {


not a big concern but the cost of re-calculate the data type. I'm fine with this

gengliangwang · 2022-10-26T05:37:26Z

Merging to master. @cloud-fan @peter-toth @ulysses-you thanks for the review.

…if data type is changed ### What changes were proposed in this pull request? Avoid reordering Add for canonicalizing if it is decimal type and the result data type is changed. Expressions are canonicalized for comparisons and explanations. For non-decimal Add expression, the order can be sorted by hashcode, and the result is supposed to be the same. However, for Add expression of Decimal type, the behavior is different: Given decimal (p1, s1) and another decimal (p2, s2), the result integral part is `max(p1-s1, p2-s2) +1`, the result decimal part is `max(s1, s2)`. Thus the result data type is `(max(p1-s1, p2-s2) +1 + max(s1, s2), max(s1, s2))`. Thus the order matters: For `(decimal(12,5) + decimal(12,6)) + decimal(3, 2)`, the first add `decimal(12,5) + decimal(12,6)` results in `decimal(14, 6)`, and then `decimal(14, 6) + decimal(3, 2)` results in `decimal(15, 6)` For `(decimal(12, 6) + decimal(3,2)) + decimal(12, 5)`, the first add `decimal(12, 6) + decimal(3,2)` results in `decimal(13, 6)`, and then `decimal(13, 6) + decimal(12, 5)` results in `decimal(14, 6)` In the following query: ``` create table foo(a decimal(12, 5), b decimal(12, 6)) using orc select sum(coalesce(a+b+ 1.75, a)) from foo ``` At first `coalesce(a+b+ 1.75, a)` is resolved as `coalesce(a+b+ 1.75, cast(a as decimal(15, 6))`. In the canonicalized version, the expression becomes `coalesce(1.75+b+a, cast(a as decimal(15, 6))`. As explained above, `1.75+b+a` is of decimal(14, 6), which is different from `cast(a as decimal(15, 6)`. Thus the following error will happen: ``` java.lang.IllegalArgumentException: requirement failed: All input types must be the same except nullable, containsNull, valueContainsNull flags. The input types found are DecimalType(14,6) DecimalType(15,6) at scala.Predef$.require(Predef.scala:281) at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck(Expression.scala:1149) at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck$(Expression.scala:1143) ``` This PR is to fix the bug. ### Why are the changes needed? Bug fix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? A new test case Closes apache#38379 from gengliangwang/fixDecimalAdd. Lead-authored-by: Gengliang Wang <gengliang@apache.org> Co-authored-by: Gengliang Wang <ltnwgl@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>

fix decimal add's canonicalized

750b6a5

gengliangwang changed the title ~~fix decimal add's canonicalized~~ [SPARK-40903][SQL] Avoid reordering decimal Add for canonicalization Oct 24, 2022

gengliangwang requested a review from cloud-fan October 24, 2022 20:49

github-actions bot added the SQL label Oct 24, 2022

cloud-fan reviewed Oct 25, 2022

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 25, 2022

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala Outdated Show resolved Hide resolved

gengliangwang and others added 3 commits October 24, 2022 22:51

add comment

81bb9c5

Update sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

4b0d05b

Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>

address comment

3b79685

cloud-fan approved these changes Oct 25, 2022

View reviewed changes

address comments

cf49066

ulysses-you reviewed Oct 26, 2022

View reviewed changes

ulysses-you approved these changes Oct 26, 2022

View reviewed changes

gengliangwang changed the title ~~[SPARK-40903][SQL] Avoid reordering decimal Add for canonicalization~~ [SPARK-40903][SQL] Avoid reordering decimal Add for canonicalization if data type is changed Oct 26, 2022

gengliangwang closed this in 1ca1414 Oct 26, 2022

gengliangwang mentioned this pull request Nov 4, 2022

[SPARK-40903][SQL][FOLLOWUP] Cast canonicalized Add as its original data type if necessary #38513

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-40903][SQL] Avoid reordering decimal Add for canonicalization if data type is changed #38379

[SPARK-40903][SQL] Avoid reordering decimal Add for canonicalization if data type is changed #38379

Uh oh!

gengliangwang commented Oct 24, 2022 •

edited

Loading

Uh oh!

gengliangwang commented Oct 24, 2022

Uh oh!

gengliangwang commented Oct 24, 2022

Uh oh!

cloud-fan Oct 25, 2022

Uh oh!

ulysses-you Oct 25, 2022

Uh oh!

gengliangwang Oct 25, 2022

Uh oh!

gengliangwang Oct 25, 2022

Uh oh!

peter-toth Oct 25, 2022

Uh oh!

peter-toth Oct 25, 2022 •

edited

Loading

Uh oh!

gengliangwang Oct 25, 2022

Uh oh!

Uh oh!

Uh oh!

ulysses-you Oct 26, 2022

Uh oh!

gengliangwang commented Oct 26, 2022

Uh oh!

Uh oh!

[SPARK-40903][SQL] Avoid reordering decimal Add for canonicalization if data type is changed #38379

[SPARK-40903][SQL] Avoid reordering decimal Add for canonicalization if data type is changed #38379

Uh oh!

Conversation

gengliangwang commented Oct 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gengliangwang commented Oct 24, 2022

Uh oh!

gengliangwang commented Oct 24, 2022

Uh oh!

cloud-fan Oct 25, 2022

Choose a reason for hiding this comment

Uh oh!

ulysses-you Oct 25, 2022

Choose a reason for hiding this comment

Uh oh!

gengliangwang Oct 25, 2022

Choose a reason for hiding this comment

Uh oh!

gengliangwang Oct 25, 2022

Choose a reason for hiding this comment

Uh oh!

peter-toth Oct 25, 2022

Choose a reason for hiding this comment

Uh oh!

peter-toth Oct 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gengliangwang Oct 25, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ulysses-you Oct 26, 2022

Choose a reason for hiding this comment

Uh oh!

gengliangwang commented Oct 26, 2022

Uh oh!

Uh oh!

gengliangwang commented Oct 24, 2022 •

edited

Loading

peter-toth Oct 25, 2022 •

edited

Loading