Skip to content

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Jan 22, 2025

Which issue does this PR close?

Closes #1294

Rationale for this change

Fix correctness issue

What changes are included in this PR?

Fall back to Spark when hashing decimals with precision > 18 when using Murmur3Hash or XXHash64.

How are these changes tested?

@codecov-commenter
Copy link

codecov-commenter commented Jan 22, 2025

Codecov Report

Attention: Patch coverage is 88.23529% with 4 lines in your changes missing coverage. Please review.

Project coverage is 39.13%. Comparing base (f09f8af) to head (0eef007).
Report is 11 commits behind head on main.

Files with missing lines Patch % Lines
...k/src/main/scala/org/apache/comet/serde/hash.scala 87.50% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##               main    #1325       +/-   ##
=============================================
- Coverage     56.12%   39.13%   -17.00%     
- Complexity      976     2065     +1089     
=============================================
  Files           119      262      +143     
  Lines         11743    60262    +48519     
  Branches       2251    12819    +10568     
=============================================
+ Hits           6591    23581    +16990     
- Misses         4012    32201    +28189     
- Partials       1140     4480     +3340     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@andygrove andygrove mentioned this pull request Jan 28, 2025
@andygrove andygrove marked this pull request as ready for review January 28, 2025 15:53
@andygrove andygrove self-assigned this Jan 28, 2025
@andygrove
Copy link
Member Author

@parthchandra @wForget fyi

@mbutrovich
Copy link
Contributor

Does this implicitly affect any data read that originated as uint64? I believe it gets converted to DECIMAL(20,0).

@andygrove
Copy link
Member Author

Does this implicitly affect any data read that originated as uint64? I believe it gets converted to DECIMAL(20,0).

Yes, in the context of a user calling the hash or xxhash64 Spark SQL functions on that data. This PR does not change anything with respect to hashing as part of shuffle.

def isSupportedType(expr: Expression): Boolean = {
for (child <- expr.children) {
child.dataType match {
case dt: DecimalType if dt.precision > 18 =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the issue it seemed that we get test failure even if the precision is less than 18. So do we want to fallback to Spark for all DecimalType values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was an earlier PR that implemented the correct behavior for the case where precision is < 18.

#1295

@andygrove
Copy link
Member Author

Thanks for the review @kazuyukitanimura and @parthchandra

@andygrove andygrove merged commit e964947 into apache:main Jan 29, 2025
74 checks passed
@andygrove andygrove deleted the hash-decimal branch January 29, 2025 19:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Result mismatch with vanilla spark in hash function with decimal input

5 participants