-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-41219][SQL] IntegralDivide use decimal(1, 0) to represent 0 #38760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
assert(Decimal(0).changePrecision(0, 0)) | ||
assert(Decimal(0L).changePrecision(0, 0)) | ||
assert(Decimal(java.math.BigInteger.valueOf(0)).changePrecision(0, 0)) | ||
assert(Decimal(BigDecimal(0)).changePrecision(0, 0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the key test, before it returned false
test("SPARK-41219: Decimal changePrecision should work with decimal(0, 0)") { | ||
val df = Seq("0.5944910").toDF("a") | ||
checkAnswer(df.selectExpr("cast(a as decimal(7,7)) div 100"), Row(0)) | ||
checkAnswer(df.select(lit(BigDecimal(0)) as "c").selectExpr("cast(c as decimal(0,0))"), Row(0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, decimal(0, 0)
is a valid decimal type? how do you use it in production?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The numeric equivalent to CHAR(0). The only values are NULL and 0. I would choose the path of least resistance. I agree that there is risk in support. What's the origin of this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now cast(a as decimal(7,7)) div 100
fails and we want to fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we change the evaluation of decimal div integer
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gengliangwang we can, it actually is the path of least resistance. By making sure the result decimal precision bigger than 0, see the comment #38760 (comment).
Besides, some other quries would fail with decimal(0, 0), it can less happen in production though. So this pr wants to make sure how does Spark handle decimal(0, 0) or leave it as is.
|
||
test("SPARK-41219: Decimal changePrecision should work with decimal(0, 0)") { | ||
val df = Seq("0.5944910").toDF("a") | ||
checkAnswer(df.selectExpr("cast(a as decimal(7,7)) div 100"), Row(0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan see this test, the decimal type of IntegralDivide can be decimal(0, 0)
(other BinaryArithmetic does not have the issue).
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala
Lines 864 to 868 in d275a83
override def resultDecimalType(p1: Int, s1: Int, p2: Int, s2: Int): DecimalType = { | |
// This follows division rule | |
val intDig = p1 - s1 + s2 | |
// No precision loss can happen as the result scale is 0. | |
DecimalType.bounded(intDig, 0) |
Or, we may change it to max(p1 - s1 + s2, 1)
It seems reasonable to say that 0 is the only valid value for |
Hmm, seems not precision 0 is allowed for all? For example, a quick search found that Mysql disallows it. Although I don't see others like postgresql, trino explicitly define it in their documention. |
@@ -420,7 +420,11 @@ final class Decimal extends Ordered[Decimal] with Serializable { | |||
// have overflowed our Long; in either case we must rescale dv to the new scale. | |||
dv = dv.setScale(scale, roundMode) | |||
if (dv.precision > precision) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the precision of BigDecimal
in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense. If we write SELECT 0bd
in Spark, the returned decimal is also decimal(1, 0)
. Maybe forbidding decimal(0, 0)
is a better choice. What do you think @srielau ?
a329eba
to
c57c95f
Compare
c57c95f
to
0117e91
Compare
The fix LGTM, thanks @ulysses-you ! Do you know when we start to have this bug? And do we ever support |
It happened in branch-3.4 after we refactor decimal binary operater.
It's more complex and is a long time issue. In short, Spark does not validate and fail if the presicion is 0 when create table or cast expression. But the dependency(hive/parquet) did it. -- work with in-memory catalog
create table t (c decimal(0, 0)) using parquet;
-- fail with parquet
-- java.lang.IllegalArgumentException: Invalid DECIMAL precision: 0
-- at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:57)
insert into table t values(0);
-- fail with hive catalog
-- Caused by: java.lang.IllegalArgumentException: Decimal precision out of allowed range [1,38]
-- at org.apache.hadoop.hive.serde2.typeinfo.HiveDecimalUtils.validateParameter(HiveDecimalUtils.java:44)
create table t (c decimal(0, 0)) using parquet; So I think we should fail if precision is 0. |
Before the refactor, what was the return type of the integral divide? also |
The decimal type is a intermediate data type of |
Thanks for the explanation! @ulysses-you can you put it in the PR description? e.g. how this bug was introduced and why it worked before. |
@cloud-fan sure, has updated the description ! |
Please link to the PR of decimal refactor. Then I think this is good to go. |
@cloud-fan updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to add some assert in DecimalType
to prevent zero precision?
I think we should, but let's do it in the master branch only to officially ban 0 decimal precision. |
@viirya I think we need but it's a breaking change so how about creating a new pr for master |
Yea, let's do it in master branch. Thanks. |
thanks, merging to master/3.4! |
### What changes were proposed in this pull request? 0 is a special case for decimal which data type can be Decimal(0, 0), to be safe we should use decimal(1, 0) to represent 0. ### Why are the changes needed? fix data correctness for regression. We do not promote the decimal precision since we refactor decimal binary operater in #36698. However, it causes the intermediate decimal type of `IntegralDivide` returns decimal(0, 0). It's dangerous that Spark does not actually support decimal(0, 0). e.g. ```sql -- work with in-memory catalog create table t (c decimal(0, 0)) using parquet; -- fail with parquet -- java.lang.IllegalArgumentException: Invalid DECIMAL precision: 0 -- at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:57) insert into table t values(0); -- fail with hive catalog -- Caused by: java.lang.IllegalArgumentException: Decimal precision out of allowed range [1,38] -- at org.apache.hadoop.hive.serde2.typeinfo.HiveDecimalUtils.validateParameter(HiveDecimalUtils.java:44) create table t (c decimal(0, 0)) using parquet; ``` And decimal(0, 0) means the data is 0, so to be safe we use decimal(1, 0) to represent 0 for `IntegralDivide`. ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? add test Closes #38760 from ulysses-you/SPARK-41219. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit a056f69) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? 0 is a special case for decimal which data type can be Decimal(0, 0), to be safe we should use decimal(1, 0) to represent 0. ### Why are the changes needed? fix data correctness for regression. We do not promote the decimal precision since we refactor decimal binary operater in apache#36698. However, it causes the intermediate decimal type of `IntegralDivide` returns decimal(0, 0). It's dangerous that Spark does not actually support decimal(0, 0). e.g. ```sql -- work with in-memory catalog create table t (c decimal(0, 0)) using parquet; -- fail with parquet -- java.lang.IllegalArgumentException: Invalid DECIMAL precision: 0 -- at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:57) insert into table t values(0); -- fail with hive catalog -- Caused by: java.lang.IllegalArgumentException: Decimal precision out of allowed range [1,38] -- at org.apache.hadoop.hive.serde2.typeinfo.HiveDecimalUtils.validateParameter(HiveDecimalUtils.java:44) create table t (c decimal(0, 0)) using parquet; ``` And decimal(0, 0) means the data is 0, so to be safe we use decimal(1, 0) to represent 0 for `IntegralDivide`. ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? add test Closes apache#38760 from ulysses-you/SPARK-41219. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit a056f69) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
0 is a special case for decimal which data type can be Decimal(0, 0), to be safe we should use decimal(1, 0) to represent 0.
Why are the changes needed?
fix data correctness for regression.
We do not promote the decimal precision since we refactor decimal binary operater in #36698. However, it causes the intermediate decimal type of
IntegralDivide
returns decimal(0, 0). It's dangerous that Spark does not actually support decimal(0, 0). e.g.And decimal(0, 0) means the data is 0, so to be safe we use decimal(1, 0) to represent 0 for
IntegralDivide
.Does this PR introduce any user-facing change?
yes, bug fix
How was this patch tested?
add test