Skip to content

fix: Fix SparkSha2 to be compliant with Spark response and add support for Int32 #16350

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 13, 2025

Conversation

rishvin
Copy link
Contributor

@rishvin rishvin commented Jun 9, 2025

Which issue does this PR close?

Rationale for this change

Please see here.

What changes are included in this PR?

  • Added support for Int32 type for sha2 bit_length argument.
  • Refactored the code base to prevent duplicates.
  • Fixed SparkSha2 response to return hex output in lowercase.
  • Fixed SparkSha2 to not throw if the bit-length is not supported and instead return NULL.
  • Updated test cases in hash/sha2.slt accordingly.

Are these changes tested?

Tested the following ways,

Are there any user-facing changes?

Yes, the sha2 function can be specified in the SQL.

The SHA2 output changes in 2 ways with this change,

  • SHA2 hex output is lowercase instead of uppercase.
  • SHA2 doesn't throw error messages for unsupported bit-type instead returns NULL. This is in-accordance with Spark behaviour.

@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) spark labels Jun 9, 2025
@rishvin
Copy link
Contributor Author

rishvin commented Jun 9, 2025

@andygrove here are the SHA2 changes to be compliant with Spark.

@andygrove
Copy link
Member

@getChan fyi

@@ -192,7 +195,7 @@ pub fn spark_hex(args: &[ColumnarValue]) -> Result<ColumnarValue, DataFusionErro

let hexed: StringArray = array
.iter()
.map(|v| v.map(hex_bytes).transpose())
.map(|v| v.map(|b| hex_bytes(b, true)).transpose())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@getChan getChan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contribution.
However, the hex function shuold not be modified.

@@ -192,7 +195,7 @@ pub fn spark_hex(args: &[ColumnarValue]) -> Result<ColumnarValue, DataFusionErro

let hexed: StringArray = array
.iter()
.map(|v| v.map(hex_bytes).transpose())
.map(|v| v.map(|b| hex_bytes(b, true)).transpose())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that Spark's hex function is expected to return uppercase characters.
https://spark.apache.org/docs/latest/api/sql/index.html#hex

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @getChan for pointing this out. In that case, I will make change for SHA2 hex representation only. Here is one example from spark-shell of SHA2,

scala> spark.sql("SELECT sha2('abc', 256) AS sha256_value").show(false)
+----------------------------------------------------------------+
|sha256_value                                                    |
+----------------------------------------------------------------+
|ba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad|
+----------------------------------------------------------------+

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change to the hex function did not cause any test failures ... does this mean we are missing some tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do have hex related test cases in Comet but they seems to be passing.
Looks like this is happening because Comet has its own hex-function, for which we call hex_bytes. But for hex-bytes lower case is set to false. Hence, no failure.

So, looks like the Datafusion's hex functionality is not getting leveraged and Comet is handling this on its own, hence we did not see regression on the Comet test-suite.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant that we should have seen test failures in DataFusion after changing the hex behavior. It looks like we should add an slt test for hex as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have test in Datafusion also datafusion/sqllogictest/test_files/spark/math/hex.slt but my PR changes the cases to lower. I will have to revert them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry. I totally missed that when reviewing.

@rishvin rishvin requested review from andygrove and getChan June 11, 2025 04:50
@rishvin
Copy link
Contributor Author

rishvin commented Jun 11, 2025

Thanks @andygrove / @getChan for the feedback. Please review the changes.

@alamb
Copy link
Contributor

alamb commented Jun 11, 2025

FYI @shehabgamin

Copy link
Contributor

@getChan getChan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for refactoring, improvement

Copy link
Contributor

@shehabgamin shehabgamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb alamb merged commit 4dd6923 into apache:main Jun 13, 2025
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spark sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SparkSha2 is not compliant with Spark and does not support Int32 type
5 participants