-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-28224][SQL] Check overflow in decimal Sum aggregate #25033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-28224][SQL] Check overflow in decimal Sum aggregate #25033
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change makes sense to me. May you please also check that SQLServer behaves in the same way? Ie. that it throws an exception in this specific case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seq(true, false).foreach {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should use resultType
I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataType
actually calls resultType
but the latter reads better for our purpose here. Will update so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, sorry, I meant something like this:
case _: DecimalType =>
CheckOverflow(sum, resultType, SQLConf.get.decimalOperationsNullOnOverflow)
In this way we do not create a new DecimalType
and I think it is more readable... Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to using matched case because resultType
is still a generic DataType and not Decimal type
override lazy val evaluateExpression: Expression = resultType match {
case d: DecimalType => CheckOverflow(sum, d, SQLConf.get.decimalOperationsNullOnOverflow)
case _ => sum
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the convention to how other dataset in this file is created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, if others think it is ok to leave it I am fine with it, I'd remove it since it is useless honestly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine either way and would also defer to others. Although i do see value in consistency here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I personally think we don't need these changes in this pr, too.
PS Please add Honestly, I don't consider this a followup to #25010. This was actually a bug leading to weird behaviors, such as |
ok to test |
@mgaido91, thank you for the review! I made the updates. Aiming for a least-surprise behavior, do you think A larger question here is perhaps if there are other operations on fixed precision that could result in overflow. And supposed we are okay with this PR fix, does it mean we should patch for all such operators? A non-issue here is [0] http://sqlfiddle.com/#!9/676c31/1 |
Test build #107252 has finished for PR 25033 at commit
|
Thanks for your work. Regarding SQLServer, I also checked its behaviour. The main difference with our current behaviour is that it always returns the max precision, instead of our formula 10 + precision. So I submitted a PR for that. LGTM |
Test build #107258 has finished for PR 25033 at commit
|
jenkins retest this please |
b668d6f
to
4094335
Compare
jenkins retest this please |
Test build #107419 has finished for PR 25033 at commit
|
hi @dongjoon-hyun, @cloud-fan. Just would like to follow-up on this PR please. |
sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
Outdated
Show resolved
Hide resolved
} | ||
} | ||
|
||
test("Aggregate sum integers") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need the two tests below for this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu the change introduces branching for BigDecimal case, so I want to verify that sum still works for integer/double case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, the case for int/double is an existing path in master, so we already have tests for that somewhere, right? I personally think the test above is enough for this pr.
Test build #109280 has finished for PR 25033 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for one minor comment.
Test build #109312 has finished for PR 25033 at commit
|
df | ||
} | ||
|
||
protected lazy val largeDecimals: DataFrame = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQLTestData
should only contain data that are widely used by many tests. Let's move the largeDecimals
to its test case.
Test build #109353 has finished for PR 25033 at commit
|
49c97d8
to
db2d9a5
Compare
Test build #109356 has finished for PR 25033 at commit
|
Thanks! Merged to master. |
What changes were proposed in this pull request?
Currently
sum
in aggregates for decimal type can overflow and return null.Sum
expression codegens arithmetic onsql.Decimal
and the output which preserves scale and precision goes intoUnsafeRowWriter
. Here overflowing will be converted to null when writing out.DecimalAggregates
because it's expecting precision of the sum (not the elements to be summed) to be less than 5.spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
Lines 1400 to 1403 in 4ebff5b
This PR adds the check at the final result of the sum operator itself.
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala
Lines 372 to 376 in 4ebff5b
https://issues.apache.org/jira/browse/SPARK-28224
How was this patch tested?
cc @mgaido91 @JoshRosen