Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32018][SQL][FollowUp][3.0] Throw exception on decimal value overflow of sum aggregation #29404

Closed
wants to merge 4 commits into from

Conversation

gengliangwang
Copy link
Member

What changes were proposed in this pull request?

This is a followup of #29125
In branch 3.0:

  1. for hash aggregation, before [SPARK-32018][SQL][3.0] UnsafeRow.setDecimal should set null with overflowed value #29125 there will be a runtime exception on decimal overflow of sum aggregation; after [SPARK-32018][SQL][3.0] UnsafeRow.setDecimal should set null with overflowed value #29125, there could be a wrong result.
  2. for sort aggregation, with/without [SPARK-32018][SQL][3.0] UnsafeRow.setDecimal should set null with overflowed value #29125, there could be a wrong result on decimal overflow.

While in master branch(the future 3.1 release), the problem doesn't exist since in #27627 there is a flag for marking whether overflow happens in aggregation buffer. However, the aggregation buffer is written in steaming checkpoints. Thus, we can't change to aggregation buffer to resolve the issue.

As there is no easy solution for returning null/throwing exception regarding spark.sql.ansi.enabled on overflow in branch 3.0, we have to make a choice here: always throw exception on decimal value overflow of sum aggregation.

Why are the changes needed?

Avoid returning wrong result in decimal value sum aggregation.

Does this PR introduce any user-facing change?

Yes, there is always exception on decimal value overflow of sum aggregation, instead of a possible wrong result.

How was this patch tested?

Unit test case

@gengliangwang
Copy link
Member Author

@SparkQA
Copy link

SparkQA commented Aug 11, 2020

Test build #127309 has finished for PR 29404 at commit 0a23279.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 11, 2020

Test build #127311 has finished for PR 29404 at commit f21f1a0.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 11, 2020

Test build #127316 has finished for PR 29404 at commit a8be9e1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 11, 2020

Test build #127313 has finished for PR 29404 at commit b9af4f5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

This adds perf overhead as we need to check overflow after each Add operation, while the master branch only checks overflow at the end because we have an extra agg buffer slot.

I think this perf overhead is necessary to avoid this correctness bug in 3.0/2.4, but I'm open to other opinions. cc @skambha @dongjoon-hyun @viirya @maropu

@viirya
Copy link
Member

viirya commented Aug 12, 2020

However, the aggregation buffer is written in steaming checkpoints. Thus, we can't change to aggregation buffer to resolve the issue.

Is this saying the isEmpty in Sum cannot be backported to branch-3.0?

@viirya
Copy link
Member

viirya commented Aug 12, 2020

And looks like we don't have many choices. I think correctness should be considered first.

@cloud-fan
Copy link
Contributor

Is this saying the isEmpty in Sum cannot be backported to branch-3.0?

I think so. 3.0 doesn't even have mechanisms to detect incompatible state store format, and it may be too much work to backport the bug fix and the streaming state store checks.

@maropu
Copy link
Member

maropu commented Aug 12, 2020

And looks like we don't have many choices. I think correctness should be considered first.

I have the same opnion with @viirya, and I think the overhead is inevitable.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Aug 12, 2020

Thank you for pinging me, @gengliangwang and @cloud-fan .

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@cloud-fan
Copy link
Contributor

cloud-fan commented Aug 13, 2020

thanks, merging to 3.0!

cloud-fan pushed a commit that referenced this pull request Aug 13, 2020
…erflow of sum aggregation

### What changes were proposed in this pull request?

This is a followup of #29125
In branch 3.0:
1. for hash aggregation, before #29125 there will be a runtime exception on decimal overflow of sum aggregation; after #29125, there could be a wrong result.
2. for sort aggregation, with/without #29125, there could be a wrong result on decimal overflow.

While in master branch(the future 3.1 release), the problem doesn't exist since in #27627 there is a flag for marking whether overflow happens in aggregation buffer. However, the aggregation buffer is written in steaming checkpoints. Thus, we can't change to aggregation buffer to resolve the issue.

As there is no easy solution for returning null/throwing exception regarding `spark.sql.ansi.enabled` on overflow in branch 3.0, we have to make a choice here: always throw exception on decimal value overflow of sum aggregation.
### Why are the changes needed?

Avoid returning wrong result in decimal value sum aggregation.

### Does this PR introduce _any_ user-facing change?

Yes, there is always exception on decimal value overflow of sum aggregation, instead of a possible wrong result.

### How was this patch tested?

Unit test case

Closes #29404 from gengliangwang/fixSum.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan cloud-fan closed this Aug 13, 2020
@dongjoon-hyun
Copy link
Member

Thank you all!

@maropu
Copy link
Member

maropu commented Aug 16, 2020

It seems this commit caused the valid test failure in DataFarmeSuite;

[info] - SPARK-28224: Aggregate sum big decimal overflow *** FAILED *** (384 milliseconds)
[info]   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 77.0 failed 1 times, most recent failure: Lost task 0.0 in stage 77.0 (TID 197, 192.168.11.10, executor driver): java.lang.ArithmeticException: Decimal(expanded,111111111111111111110.246000000000000000,39,18}) cannot be represented as Decimal(38, 18).
[info] 	at org.apache.spark.sql.types.Decimal.toPrecision(Decimal.scala:369)
[info] 	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregate_sum_0$(Unknown Source)
[info] 	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doConsume_0$(Unknown Source)
[info] 	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithoutKey_0$(Unknown Source)
[info] 	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
[info] 	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-3.0-test-maven-hadoop-2.7-hive-2.3/533/

I've checked that the test failed w/this commit and passed w/o it in my local env.
@gengliangwang Could you check the failure?

@gengliangwang
Copy link
Member Author

gengliangwang commented Aug 17, 2020

@maropu Thanks for reporting. I have created #29448 to fix the test failure.

@cloud-fan
Copy link
Contributor

ah it's branch-3.0 so the github action doesn't count. @HyukjinKwon how hard it is to have github action in branch-3.0?

cloud-fan pushed a commit that referenced this pull request Aug 17, 2020
### What changes were proposed in this pull request?

Revert SPARK-32018 related changes in branch 3.0: #29125 and  #29404

### Why are the changes needed?

#29404 is made to fix correctness regression introduced by #29125. However, the behavior of decimal overflow is strange in non-ansi mode:
1. from 3.0.0 to 3.0.1: decimal overflow will throw exceptions instead of returning null on decimal overflow
2. from 3.0.1 to 3.1.0: decimal overflow will return null instead of throwing exceptions.

So, this PR proposes to revert both #29404 and #29125. So that Spark will return null on decimal overflow in Spark 3.0.0 and Spark 3.0.1.

### Does this PR introduce _any_ user-facing change?

Yes, Spark will return null on decimal overflow in Spark 3.0.1.

### How was this patch tested?

Unit tests

Closes #29450 from gengliangwang/revertDecimalOverflow.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@HyukjinKwon
Copy link
Member

@cloud-fan, I will port it back to other branches. I think it's doable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants