Skip to content

[SPARK-41219][SQL] IntegralDivide use decimal(1, 0) to represent 0 #38760

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

ulysses-you
Copy link
Contributor

@ulysses-you ulysses-you commented Nov 22, 2022

What changes were proposed in this pull request?

0 is a special case for decimal which data type can be Decimal(0, 0), to be safe we should use decimal(1, 0) to represent 0.

Why are the changes needed?

fix data correctness for regression.

We do not promote the decimal precision since we refactor decimal binary operater in #36698. However, it causes the intermediate decimal type of IntegralDivide returns decimal(0, 0). It's dangerous that Spark does not actually support decimal(0, 0). e.g.

-- work with in-memory catalog
create table t (c decimal(0, 0)) using parquet;
-- fail with parquet
-- java.lang.IllegalArgumentException: Invalid DECIMAL precision: 0
--	at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:57)
insert into table t values(0);

-- fail with hive catalog
-- Caused by: java.lang.IllegalArgumentException: Decimal precision out of allowed range [1,38]
--	at org.apache.hadoop.hive.serde2.typeinfo.HiveDecimalUtils.validateParameter(HiveDecimalUtils.java:44)
create table t (c decimal(0, 0)) using parquet;

And decimal(0, 0) means the data is 0, so to be safe we use decimal(1, 0) to represent 0 for IntegralDivide.

Does this PR introduce any user-facing change?

yes, bug fix

How was this patch tested?

add test

@github-actions github-actions bot added the SQL label Nov 22, 2022
assert(Decimal(0).changePrecision(0, 0))
assert(Decimal(0L).changePrecision(0, 0))
assert(Decimal(java.math.BigInteger.valueOf(0)).changePrecision(0, 0))
assert(Decimal(BigDecimal(0)).changePrecision(0, 0))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the key test, before it returned false

@ulysses-you
Copy link
Contributor Author

test("SPARK-41219: Decimal changePrecision should work with decimal(0, 0)") {
val df = Seq("0.5944910").toDF("a")
checkAnswer(df.selectExpr("cast(a as decimal(7,7)) div 100"), Row(0))
checkAnswer(df.select(lit(BigDecimal(0)) as "c").selectExpr("cast(c as decimal(0,0))"), Row(0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, decimal(0, 0) is a valid decimal type? how do you use it in production?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The numeric equivalent to CHAR(0). The only values are NULL and 0. I would choose the path of least resistance. I agree that there is risk in support. What's the origin of this PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now cast(a as decimal(7,7)) div 100 fails and we want to fix it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we change the evaluation of decimal div integer instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gengliangwang we can, it actually is the path of least resistance. By making sure the result decimal precision bigger than 0, see the comment #38760 (comment).

Besides, some other quries would fail with decimal(0, 0), it can less happen in production though. So this pr wants to make sure how does Spark handle decimal(0, 0) or leave it as is.


test("SPARK-41219: Decimal changePrecision should work with decimal(0, 0)") {
val df = Seq("0.5944910").toDF("a")
checkAnswer(df.selectExpr("cast(a as decimal(7,7)) div 100"), Row(0))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan see this test, the decimal type of IntegralDivide can be decimal(0, 0) (other BinaryArithmetic does not have the issue).

override def resultDecimalType(p1: Int, s1: Int, p2: Int, s2: Int): DecimalType = {
// This follows division rule
val intDig = p1 - s1 + s2
// No precision loss can happen as the result scale is 0.
DecimalType.bounded(intDig, 0)

Or, we may change it to max(p1 - s1 + s2, 1)

@cloud-fan
Copy link
Contributor

It seems reasonable to say that 0 is the only valid value for decimal(0, 0). Forbidding decimal(0, 0) seems also reasonable but is more risky.

@cloud-fan
Copy link
Contributor

cc @srielau @viirya

@viirya
Copy link
Member

viirya commented Nov 23, 2022

Hmm, seems not precision 0 is allowed for all? For example, a quick search found that Mysql disallows it. Although I don't see others like postgresql, trino explicitly define it in their documention.

@@ -420,7 +420,11 @@ final class Decimal extends Ordered[Decimal] with Serializable {
// have overflowed our Long; in either case we must rescale dv to the new scale.
dv = dv.setScale(scale, roundMode)
if (dv.precision > precision) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the precision of BigDecimal in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's 1, can see the docs in java.math.BigDecimal
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense. If we write SELECT 0bd in Spark, the returned decimal is also decimal(1, 0). Maybe forbidding decimal(0, 0) is a better choice. What do you think @srielau ?

@ulysses-you ulysses-you changed the title [SPARK-41219][SQL] Decimal changePrecision should work with decimal(0, 0) [SPARK-41219][SQL] IntegralDivide use decimal(1, 0) to represent 0 Jan 30, 2023
@cloud-fan
Copy link
Contributor

The fix LGTM, thanks @ulysses-you ! Do you know when we start to have this bug? And do we ever support decimal(0, 0)? like in CREATE TABLE and CAST?

@ulysses-you
Copy link
Contributor Author

Do you know when we start to have this bug?

It happened in branch-3.4 after we refactor decimal binary operater.

And do we ever support decimal(0, 0)? like in CREATE TABLE and CAST?

It's more complex and is a long time issue. In short, Spark does not validate and fail if the presicion is 0 when create table or cast expression. But the dependency(hive/parquet) did it.

-- work with in-memory catalog
create table t (c decimal(0, 0)) using parquet;
-- fail with parquet
-- java.lang.IllegalArgumentException: Invalid DECIMAL precision: 0
--	at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:57)
insert into table t values(0);

-- fail with hive catalog
-- Caused by: java.lang.IllegalArgumentException: Decimal precision out of allowed range [1,38]
--	at org.apache.hadoop.hive.serde2.typeinfo.HiveDecimalUtils.validateParameter(HiveDecimalUtils.java:44)
create table t (c decimal(0, 0)) using parquet;

So I think we should fail if precision is 0.

@cloud-fan
Copy link
Contributor

It happened in branch-3.4 after we refactor decimal binary operater.

Before the refactor, what was the return type of the integral divide? also decimal(1, 0)?

@ulysses-you
Copy link
Contributor Author

The decimal type is a intermediate data type of IntegralDivide, the final data type is long.. Before we promoted the precision for the intermediate data type so it's not decimal(1, 0).

@cloud-fan
Copy link
Contributor

Thanks for the explanation! @ulysses-you can you put it in the PR description? e.g. how this bug was introduced and why it worked before.

@ulysses-you
Copy link
Contributor Author

@cloud-fan sure, has updated the description !

@cloud-fan
Copy link
Contributor

Please link to the PR of decimal refactor. Then I think this is good to go.

@ulysses-you
Copy link
Contributor Author

@cloud-fan updated

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add some assert in DecimalType to prevent zero precision?

@cloud-fan
Copy link
Contributor

Do we need to add some assert in DecimalType to prevent zero precision?

I think we should, but let's do it in the master branch only to officially ban 0 decimal precision.

@ulysses-you
Copy link
Contributor Author

@viirya I think we need but it's a breaking change so how about creating a new pr for master

@viirya
Copy link
Member

viirya commented Jan 31, 2023

I think we need but it's a breaking change so how about creating a new pr for master

Yea, let's do it in master branch. Thanks.

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.4!

@cloud-fan cloud-fan closed this in a056f69 Jan 31, 2023
cloud-fan pushed a commit that referenced this pull request Jan 31, 2023
### What changes were proposed in this pull request?

0 is a special case for decimal which data type can be Decimal(0, 0), to be safe we should use decimal(1, 0) to represent 0.

### Why are the changes needed?

fix data correctness for regression.

We do not promote the decimal precision since we refactor decimal binary operater in #36698. However, it causes the intermediate decimal type of `IntegralDivide` returns decimal(0, 0). It's dangerous that Spark does not actually support decimal(0, 0). e.g.
```sql
-- work with in-memory catalog
create table t (c decimal(0, 0)) using parquet;
-- fail with parquet
-- java.lang.IllegalArgumentException: Invalid DECIMAL precision: 0
--	at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:57)
insert into table t values(0);

-- fail with hive catalog
-- Caused by: java.lang.IllegalArgumentException: Decimal precision out of allowed range [1,38]
--	at org.apache.hadoop.hive.serde2.typeinfo.HiveDecimalUtils.validateParameter(HiveDecimalUtils.java:44)
create table t (c decimal(0, 0)) using parquet;
```
And decimal(0, 0) means the data is 0, so to be safe we use decimal(1, 0) to represent 0 for `IntegralDivide`.

### Does this PR introduce _any_ user-facing change?

yes, bug fix

### How was this patch tested?

add test

Closes #38760 from ulysses-you/SPARK-41219.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit a056f69)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@ulysses-you ulysses-you deleted the SPARK-41219 branch January 31, 2023 04:23
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
### What changes were proposed in this pull request?

0 is a special case for decimal which data type can be Decimal(0, 0), to be safe we should use decimal(1, 0) to represent 0.

### Why are the changes needed?

fix data correctness for regression.

We do not promote the decimal precision since we refactor decimal binary operater in apache#36698. However, it causes the intermediate decimal type of `IntegralDivide` returns decimal(0, 0). It's dangerous that Spark does not actually support decimal(0, 0). e.g.
```sql
-- work with in-memory catalog
create table t (c decimal(0, 0)) using parquet;
-- fail with parquet
-- java.lang.IllegalArgumentException: Invalid DECIMAL precision: 0
--	at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:57)
insert into table t values(0);

-- fail with hive catalog
-- Caused by: java.lang.IllegalArgumentException: Decimal precision out of allowed range [1,38]
--	at org.apache.hadoop.hive.serde2.typeinfo.HiveDecimalUtils.validateParameter(HiveDecimalUtils.java:44)
create table t (c decimal(0, 0)) using parquet;
```
And decimal(0, 0) means the data is 0, so to be safe we use decimal(1, 0) to represent 0 for `IntegralDivide`.

### Does this PR introduce _any_ user-facing change?

yes, bug fix

### How was this patch tested?

add test

Closes apache#38760 from ulysses-you/SPARK-41219.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit a056f69)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants