[SPARK-28067][SQL]Fix incorrect results during aggregate sum for decimal overflow by throwing exception #27629

skambha · 2020-02-18T21:55:39Z

What changes were proposed in this pull request?

JIRA SPARK-28067: Wrong results are returned for aggregate sum with decimals with whole stage codegen enabled

Repro:
WholeStage enabled enabled -> Wrong results
WholeStage disabled -> Returns exception Decimal precision 39 exceeds max precision 38

Issues:

Wrong results are returned which is bad
Inconsistency between whole stage enabled and disabled.

Cause:
Sum does not take care of possibility of overflow for the intermediate steps. ie the updateExpressions and mergeExpressions.

This PR makes the following changes:

Throw exception if there is an decimal overflow when computing the sum.
This will be consistent with how Spark behaves when whole stage codegen is disabled.

Pros:

No wrong results
Consistent behavior between wholestage enabled and disabled
DB’s have similar behavior, there is precedence

Before Fix: - WRONG RESULTS

scala> val df = Seq(
     |  (BigDecimal("10000000000000000000"), 1),
     |  (BigDecimal("10000000000000000000"), 1),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2)).toDF("decNum", "intNum")
df: org.apache.spark.sql.DataFrame = [decNum: decimal(38,18), intNum: int]

scala> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, "intNum").agg(sum("decNum"))
df2: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]

scala> df2.show(40,false)
+---------------------------------------+                                       
|sum(decNum)                            |
+---------------------------------------+
|20000000000000000000.000000000000000000|
+---------------------------------------+

After fix:

scala> val df = Seq(
     |  (BigDecimal("10000000000000000000"), 1),
     |  (BigDecimal("10000000000000000000"), 1),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2)).toDF("decNum", "intNum")
df: org.apache.spark.sql.DataFrame = [decNum: decimal(38,18), intNum: int]

scala> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, "intNum").agg(sum("decNum"))
df2: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]

scala> df2.show(40,false)
20/02/18 13:36:19 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 9)    
java.lang.ArithmeticException: Decimal(expanded,100000000000000000000.000000000000000000,39,18}) cannot be represented as Decimal(38, 18).

Why are the changes needed?

The changes are needed in order to fix the wrong results that are returned for decimal aggregate sum.

Does this PR introduce any user-facing change?

Prior to this fix, user would see wrong results on aggregate sum that involved decimal overflow, but now the user will see exception. This behavior is consistent as well with how Spark behaves when whole stage code gen is disabled.

How was this patch tested?

New test has been added and existing tests for sql, catalyst and hive suites were run ok.

…overflow, throw exception and make it consistent to when wholestage codegen is disabled. Also fix the affected test from spark-28224

AmplabJenkins · 2020-02-18T21:58:09Z

Can one of the admins verify this patch?

skambha · 2020-02-18T22:01:39Z

Please see my notes in this JIRA for the two approaches to fix this issue. This is a implementation for approach 1 fix. This is simple and straightforward compared to the approach2 PR.

I have another pr 27627 that takes approach 2 to fix this issue. Both these will fix the incorrect results (which is good). Each have their pros and cons as listed in my comment in the JIRA.

skambha

SPARK-28224 took care of decimal overflow for sum only partially for 2 values. In this test case that was added as part of SPARK-28224, if you add another row into the dataset, you will get incorrect results and not return null on overflow.

In this PR we address decimal overflow in aggregate sum by throwing an exception. Hence this test has been modified.

skambha · 2020-02-18T22:16:48Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

+    Seq("true", "false").foreach { codegenEnabled =>
+      withSQLConf((SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, codegenEnabled)) {
        val structDf = largeDecimals.select("a").agg(sum("a"))
-        if (!ansiEnabled) {


SPARK-28224 took care of decimal overflow for sum only partially for 2 values. In this test case that was added as part of SPARK-28224, if you add another row into the dataset, you will get incorrect results and not return null on overflow.

In this PR we address decimal overflow in aggregate sum by throwing an exception. Hence this test has been modified.

HyukjinKwon · 2020-02-19T08:27:01Z

cc @mgaido91

mgaido91 · 2020-02-19T22:04:28Z

This PR would introduce regressions. Checking every sum means that temporary overflows would cause an exception. Eg., if you sum MAX_INT, 10, -100, then MAX_INT + 10 would cause the exception. In the current code, this sum is handled properly and returns the correct result, because the temporary overflow is fixed by summing -100. So we would raise exceptions even when not needed. IIRC, other DBs treat this properly, so temporary overflow don't cause exceptions.

The proper fix for this would be to use as buffer a larger data type than the returned one. I remember I had a PR for that (#25347). You can check the comments and history of it.

skambha · 2020-02-21T03:20:31Z

This PR would introduce regressions. Checking every sum means that temporary overflows would cause an exception. Eg., if you sum MAX_INT, 10, -100, then MAX_INT + 10 would cause the exception. In the current code, this sum is handled properly and returns the correct result, because the temporary overflow is fixed by summing -100. So we would raise exceptions even when not needed. IIRC, other DBs treat this properly, so temporary overflow don't cause exceptions.

I see what you are saying, but this PR is targeted to the Aggregate sum of the decimal type (result type is decimal type) only and not for int or long. Sum of ints is handled the same way as before and does not introduce any regressions for the above mentioned use case. [1]

This PR is trying to handle the use case regarding aggregate Sum for decimal:

Sum of decimal type overflows and returns wrong results.
Note, In the current code (without this PR also), the same operation of sum on decimal type will throw an exception when whole stage code gen is disabled.

(Furthermore, even if spark.sql.ansi.enabled is set to true, we do not return null. This conf property is to ensure that any overflows will return null.)

Here, we are dealing with a correctness issue. This pr's approach is to follow the result returned by the whole stage codegen disabled codepath.

Actually this issue is mentioned in PR/SPARK-23179 [3] as a special case. SPARK-28224 partially addressed this.

fwiw, I checked this on MS SQL Server and it throws an error as well. [2]

The proper fix for this would be to use as buffer a larger data type than the returned one. I remember I had a PR for that (#25347). You can check the comments and history of it.

Sure. I checked this (#25347), and this deals with increasing the datatype for the aggregate sum of long's to decimal to avoid temporary overflow. The decision was to not make the change because a) since it is not a correctness issue, and b) because of the performance hit and c) workaround exists - that if the user sees exception because of temporary overflow, they can cast it to a decimal. [4].

[1] —> SPARK-26218 Overflow on arithmetic operations returns incorrect result
[2] http://sqlfiddle.com/#!18/e7ecc/1
[3] —> SPARK-23179 Support option to throw exception if overflow occurs during Decimal arithmetic
[4] #25347 (comment)

Thanks for your comments.

skambha · 2020-02-21T03:43:54Z

@mgaido91, Since you worked on a lot of the overflow issues, if you can review the two approaches listed here in SPARK-28067 and add your thoughts, I'd appreciate it. Thanks.

mgaido91 · 2020-02-22T11:32:25Z

well' in this PR you are changing the logical plan, that's weird that the 2 executions mode return different results and we have to fix the plan for this.

github-actions · 2020-06-08T00:22:33Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

skambha · 2020-06-08T17:35:06Z

Closing this in favor of the other approach in #27627 which got merged into trunk.

Fix incorrect results during aggregate sum for decimal when there is …

7aae81c

…overflow, throw exception and make it consistent to when wholestage codegen is disabled. Also fix the affected test from spark-28224

skambha commented Feb 18, 2020

View reviewed changes

kiszk mentioned this pull request Feb 24, 2020

[SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow #27627

Closed

dongjoon-hyun added the SQL label Feb 28, 2020

github-actions bot added the Stale label Jun 8, 2020

skambha closed this Jun 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-28067][SQL]Fix incorrect results during aggregate sum for decimal overflow by throwing exception #27629

[SPARK-28067][SQL]Fix incorrect results during aggregate sum for decimal overflow by throwing exception #27629

Uh oh!

skambha commented Feb 18, 2020

Uh oh!

AmplabJenkins commented Feb 18, 2020

Uh oh!

skambha commented Feb 18, 2020

Uh oh!

skambha left a comment

Uh oh!

skambha Feb 18, 2020

Uh oh!

HyukjinKwon commented Feb 19, 2020

Uh oh!

mgaido91 commented Feb 19, 2020

Uh oh!

skambha commented Feb 21, 2020 •

edited

Loading

Uh oh!

skambha commented Feb 21, 2020

Uh oh!

mgaido91 commented Feb 22, 2020

Uh oh!

github-actions bot commented Jun 8, 2020

Uh oh!

skambha commented Jun 8, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-28067][SQL]Fix incorrect results during aggregate sum for decimal overflow by throwing exception #27629

[SPARK-28067][SQL]Fix incorrect results during aggregate sum for decimal overflow by throwing exception #27629

Uh oh!

Conversation

skambha commented Feb 18, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

AmplabJenkins commented Feb 18, 2020

Uh oh!

skambha commented Feb 18, 2020

Uh oh!

skambha left a comment

Choose a reason for hiding this comment

Uh oh!

skambha Feb 18, 2020

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Feb 19, 2020

Uh oh!

mgaido91 commented Feb 19, 2020

Uh oh!

skambha commented Feb 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skambha commented Feb 21, 2020

Uh oh!

mgaido91 commented Feb 22, 2020

Uh oh!

github-actions bot commented Jun 8, 2020

Uh oh!

skambha commented Jun 8, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

skambha commented Feb 21, 2020 •

edited

Loading