[SPARK-41219][SQL] IntegralDivide use decimal(1, 0) to represent 0 #38760

ulysses-you · 2022-11-22T12:46:24Z

What changes were proposed in this pull request?

0 is a special case for decimal which data type can be Decimal(0, 0), to be safe we should use decimal(1, 0) to represent 0.

Why are the changes needed?

fix data correctness for regression.

We do not promote the decimal precision since we refactor decimal binary operater in #36698. However, it causes the intermediate decimal type of IntegralDivide returns decimal(0, 0). It's dangerous that Spark does not actually support decimal(0, 0). e.g.

-- work with in-memory catalog
create table t (c decimal(0, 0)) using parquet;
-- fail with parquet
-- java.lang.IllegalArgumentException: Invalid DECIMAL precision: 0
--	at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:57)
insert into table t values(0);

-- fail with hive catalog
-- Caused by: java.lang.IllegalArgumentException: Decimal precision out of allowed range [1,38]
--	at org.apache.hadoop.hive.serde2.typeinfo.HiveDecimalUtils.validateParameter(HiveDecimalUtils.java:44)
create table t (c decimal(0, 0)) using parquet;

And decimal(0, 0) means the data is 0, so to be safe we use decimal(1, 0) to represent 0 for IntegralDivide.

Does this PR introduce any user-facing change?

yes, bug fix

How was this patch tested?

add test

ulysses-you · 2022-11-23T01:52:32Z

sql/catalyst/src/test/scala/org/apache/spark/sql/types/DecimalSuite.scala

+    assert(Decimal(0).changePrecision(0, 0))
+    assert(Decimal(0L).changePrecision(0, 0))
+    assert(Decimal(java.math.BigInteger.valueOf(0)).changePrecision(0, 0))
+    assert(Decimal(BigDecimal(0)).changePrecision(0, 0))


this is the key test, before it returned false

ulysses-you · 2022-11-23T01:53:48Z

cc @cloud-fan @revans2 @gengliangwang

cloud-fan · 2022-11-23T04:02:45Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

+  test("SPARK-41219: Decimal changePrecision should work with decimal(0, 0)") {
+    val df = Seq("0.5944910").toDF("a")
+    checkAnswer(df.selectExpr("cast(a as decimal(7,7)) div 100"), Row(0))
+    checkAnswer(df.select(lit(BigDecimal(0)) as "c").selectExpr("cast(c as decimal(0,0))"), Row(0))


hmmm, decimal(0, 0) is a valid decimal type? how do you use it in production?

The numeric equivalent to CHAR(0). The only values are NULL and 0. I would choose the path of least resistance. I agree that there is risk in support. What's the origin of this PR?

Now cast(a as decimal(7,7)) div 100 fails and we want to fix it.

Shall we change the evaluation of decimal div integer instead?

@gengliangwang we can, it actually is the path of least resistance. By making sure the result decimal precision bigger than 0, see the comment #38760 (comment).

Besides, some other quries would fail with decimal(0, 0), it can less happen in production though. So this pr wants to make sure how does Spark handle decimal(0, 0) or leave it as is.

ulysses-you · 2022-11-23T04:39:42Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

+
+  test("SPARK-41219: Decimal changePrecision should work with decimal(0, 0)") {
+    val df = Seq("0.5944910").toDF("a")
+    checkAnswer(df.selectExpr("cast(a as decimal(7,7)) div 100"), Row(0))


@cloud-fan see this test, the decimal type of IntegralDivide can be decimal(0, 0) (other BinaryArithmetic does not have the issue).

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/arithmetic.scala

Lines 864 to 868 in d275a83

override def resultDecimalType(p1: Int, s1: Int, p2: Int, s2: Int): DecimalType = {

// This follows division rule

val intDig = p1 - s1 + s2

// No precision loss can happen as the result scale is 0.

DecimalType.bounded(intDig, 0)

Or, we may change it to max(p1 - s1 + s2, 1)

cloud-fan · 2022-11-23T06:43:35Z

It seems reasonable to say that 0 is the only valid value for decimal(0, 0). Forbidding decimal(0, 0) seems also reasonable but is more risky.

cloud-fan · 2022-11-23T06:46:08Z

cc @srielau @viirya

viirya · 2022-11-23T08:26:17Z

Hmm, seems not precision 0 is allowed for all? For example, a quick search found that Mysql disallows it. Although I don't see others like postgresql, trino explicitly define it in their documention.

cloud-fan · 2022-11-24T03:59:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala

@@ -420,7 +420,11 @@ final class Decimal extends Ordered[Decimal] with Serializable {
      // have overflowed our Long; in either case we must rescale dv to the new scale.
      dv = dv.setScale(scale, roundMode)
      if (dv.precision > precision) {


what is the precision of BigDecimal in this case?

it's 1, can see the docs in java.math.BigDecimal

I think this makes sense. If we write SELECT 0bd in Spark, the returned decimal is also decimal(1, 0). Maybe forbidding decimal(0, 0) is a better choice. What do you think @srielau ?

cloud-fan · 2023-01-30T10:50:59Z

The fix LGTM, thanks @ulysses-you ! Do you know when we start to have this bug? And do we ever support decimal(0, 0)? like in CREATE TABLE and CAST?

ulysses-you · 2023-01-30T11:19:11Z

Do you know when we start to have this bug?

It happened in branch-3.4 after we refactor decimal binary operater.

And do we ever support decimal(0, 0)? like in CREATE TABLE and CAST?

It's more complex and is a long time issue. In short, Spark does not validate and fail if the presicion is 0 when create table or cast expression. But the dependency(hive/parquet) did it.

-- work with in-memory catalog
create table t (c decimal(0, 0)) using parquet;
-- fail with parquet
-- java.lang.IllegalArgumentException: Invalid DECIMAL precision: 0
--	at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:57)
insert into table t values(0);

-- fail with hive catalog
-- Caused by: java.lang.IllegalArgumentException: Decimal precision out of allowed range [1,38]
--	at org.apache.hadoop.hive.serde2.typeinfo.HiveDecimalUtils.validateParameter(HiveDecimalUtils.java:44)
create table t (c decimal(0, 0)) using parquet;

So I think we should fail if precision is 0.

cloud-fan · 2023-01-30T11:21:25Z

It happened in branch-3.4 after we refactor decimal binary operater.

Before the refactor, what was the return type of the integral divide? also decimal(1, 0)?

ulysses-you · 2023-01-30T11:28:33Z

The decimal type is a intermediate data type of IntegralDivide, the final data type is long.. Before we promoted the precision for the intermediate data type so it's not decimal(1, 0).

cloud-fan · 2023-01-30T15:19:15Z

Thanks for the explanation! @ulysses-you can you put it in the PR description? e.g. how this bug was introduced and why it worked before.

ulysses-you · 2023-01-31T01:39:04Z

@cloud-fan sure, has updated the description !

cloud-fan · 2023-01-31T02:05:55Z

Please link to the PR of decimal refactor. Then I think this is good to go.

ulysses-you · 2023-01-31T02:25:33Z

@cloud-fan updated

viirya

Do we need to add some assert in DecimalType to prevent zero precision?

cloud-fan · 2023-01-31T02:51:18Z

Do we need to add some assert in DecimalType to prevent zero precision?

I think we should, but let's do it in the master branch only to officially ban 0 decimal precision.

ulysses-you · 2023-01-31T02:51:26Z

@viirya I think we need but it's a breaking change so how about creating a new pr for master

viirya · 2023-01-31T03:23:56Z

I think we need but it's a breaking change so how about creating a new pr for master

Yea, let's do it in master branch. Thanks.

cloud-fan · 2023-01-31T03:58:10Z

thanks, merging to master/3.4!

### What changes were proposed in this pull request? 0 is a special case for decimal which data type can be Decimal(0, 0), to be safe we should use decimal(1, 0) to represent 0. ### Why are the changes needed? fix data correctness for regression. We do not promote the decimal precision since we refactor decimal binary operater in #36698. However, it causes the intermediate decimal type of `IntegralDivide` returns decimal(0, 0). It's dangerous that Spark does not actually support decimal(0, 0). e.g. ```sql -- work with in-memory catalog create table t (c decimal(0, 0)) using parquet; -- fail with parquet -- java.lang.IllegalArgumentException: Invalid DECIMAL precision: 0 -- at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:57) insert into table t values(0); -- fail with hive catalog -- Caused by: java.lang.IllegalArgumentException: Decimal precision out of allowed range [1,38] -- at org.apache.hadoop.hive.serde2.typeinfo.HiveDecimalUtils.validateParameter(HiveDecimalUtils.java:44) create table t (c decimal(0, 0)) using parquet; ``` And decimal(0, 0) means the data is 0, so to be safe we use decimal(1, 0) to represent 0 for `IntegralDivide`. ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? add test Closes #38760 from ulysses-you/SPARK-41219. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit a056f69) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? 0 is a special case for decimal which data type can be Decimal(0, 0), to be safe we should use decimal(1, 0) to represent 0. ### Why are the changes needed? fix data correctness for regression. We do not promote the decimal precision since we refactor decimal binary operater in apache#36698. However, it causes the intermediate decimal type of `IntegralDivide` returns decimal(0, 0). It's dangerous that Spark does not actually support decimal(0, 0). e.g. ```sql -- work with in-memory catalog create table t (c decimal(0, 0)) using parquet; -- fail with parquet -- java.lang.IllegalArgumentException: Invalid DECIMAL precision: 0 -- at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:57) insert into table t values(0); -- fail with hive catalog -- Caused by: java.lang.IllegalArgumentException: Decimal precision out of allowed range [1,38] -- at org.apache.hadoop.hive.serde2.typeinfo.HiveDecimalUtils.validateParameter(HiveDecimalUtils.java:44) create table t (c decimal(0, 0)) using parquet; ``` And decimal(0, 0) means the data is 0, so to be safe we use decimal(1, 0) to represent 0 for `IntegralDivide`. ### Does this PR introduce _any_ user-facing change? yes, bug fix ### How was this patch tested? add test Closes apache#38760 from ulysses-you/SPARK-41219. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit a056f69) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

github-actions bot added the SQL label Nov 22, 2022

ulysses-you commented Nov 23, 2022

View reviewed changes

cloud-fan reviewed Nov 23, 2022

View reviewed changes

ulysses-you commented Nov 23, 2022

View reviewed changes

cloud-fan reviewed Nov 24, 2022

View reviewed changes

ulysses-you force-pushed the SPARK-41219 branch from a329eba to c57c95f Compare January 30, 2023 05:43

ulysses-you changed the title ~~[SPARK-41219][SQL] Decimal changePrecision should work with decimal(0, 0)~~ [SPARK-41219][SQL] IntegralDivide use decimal(1, 0) to represent 0 Jan 30, 2023

IntegralDivide use decimal(1, 0) to represent 0

0117e91

ulysses-you force-pushed the SPARK-41219 branch from c57c95f to 0117e91 Compare January 30, 2023 05:49

viirya approved these changes Jan 31, 2023

View reviewed changes

cloud-fan closed this in a056f69 Jan 31, 2023

ulysses-you deleted the SPARK-41219 branch January 31, 2023 04:23

	override def resultDecimalType(p1: Int, s1: Int, p2: Int, s2: Int): DecimalType = {
	// This follows division rule
	val intDig = p1 - s1 + s2
	// No precision loss can happen as the result scale is 0.
	DecimalType.bounded(intDig, 0)

[SPARK-41219][SQL] IntegralDivide use decimal(1, 0) to represent 0 #38760

[SPARK-41219][SQL] IntegralDivide use decimal(1, 0) to represent 0 #38760

Uh oh!

Conversation

ulysses-you commented Nov 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ulysses-you commented Nov 23, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Nov 23, 2022

Uh oh!

cloud-fan commented Nov 23, 2022

Uh oh!

viirya commented Nov 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jan 30, 2023

Uh oh!

ulysses-you commented Jan 30, 2023

Uh oh!

cloud-fan commented Jan 30, 2023

Uh oh!

ulysses-you commented Jan 30, 2023

Uh oh!

cloud-fan commented Jan 30, 2023

Uh oh!

ulysses-you commented Jan 31, 2023

Uh oh!

cloud-fan commented Jan 31, 2023

Uh oh!

ulysses-you commented Jan 31, 2023

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jan 31, 2023

Uh oh!

ulysses-you commented Jan 31, 2023

Uh oh!

viirya commented Jan 31, 2023

Uh oh!

cloud-fan commented Jan 31, 2023

Uh oh!

Uh oh!

ulysses-you commented Nov 22, 2022 •

edited

Loading

viirya commented Nov 23, 2022 •

edited

Loading