[SPARK-47307][SQL][3.5] Add a config to optionally chunk base64 strings #47325

wForget · 2024-07-12T10:49:44Z

Backports #47303 to 3.5

What changes were proposed in this pull request?

[SPARK-47307] Add a config to optionally chunk base64 strings

Why are the changes needed?

In #35110, it was incorrectly asserted that:

ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt

This is not true as the previous code called:

public static byte[] encodeBase64(byte[] binaryData)

Which states:

Encodes binary data using the base64 algorithm but does not chunk the output.

However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing test suite.

Was this patch authored or co-authored using generative AI tooling?

No

Follow up apache#45408 ### What changes were proposed in this pull request? [[SPARK-47307](https://issues.apache.org/jira/browse/SPARK-47307)] Add a config to optionally chunk base64 strings ### Why are the changes needed? In apache#35110, it was incorrectly asserted that: > ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt This is not true as the previous code called: ```java public static byte[] encodeBase64(byte[] binaryData) ``` Which states: > Encodes binary data using the base64 algorithm but does not chunk the output. However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test suite. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47303 from wForget/SPARK-47307. Lead-authored-by: Ted Jenks <tedcj@palantir.com> Co-authored-by: wforget <643348094@qq.com> Co-authored-by: Kent Yao <yao@apache.org> Co-authored-by: Ted Chester Jenks <tedcj@palantir.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 8d3d4f9)

yaooqinn · 2024-07-12T16:34:59Z

[info]   Cause: org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot evaluate expression: base64(0x61616161616161616161616161616161616161616161616161616161616161616161616161616161616161616161616161616161616161616161)
[info]   at org.apache.spark.SparkException$.internalError(SparkException.scala:92)
[info]   at org.apache.spark.SparkException$.internalError(SparkException.scala:96)

can you check these test failures?

dongjoon-hyun · 2024-07-12T18:09:41Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+    .doc("Whether to truncate string generated by the `Base64` function. When true, base64" +
+      " strings generated by the base64 function are chunked into lines of at most 76" +
+      " characters. When false, the base64 strings are not chunked.")
+    .version("3.5.2")


IIUC, if we start to backport SPARK-47307, it will go to Apache Spark 3.4.4 together, right? In that case, I'm curious if 3.5.2 is correct.

Hmm. Got it. I saw this comment from Apache Spark 3.5.2 release manager, @yaooqinn .

https://github.com/apache/spark/pull/47303/files#r1675645743

If then, are we going to update these values from master and branch-3.5 when we do the release of Apache Spark 3.4.4? I'm fine if we are going to do in that way.

Thank you @dongjoon-hyun.

Not related to this PR, maybe we shall add multiple fixed version in this field, such as 3.4.4, 3.5.2, 4.0.0

Theoretically, it's possible, but it will enforce us to update all the existing configurations and documentations. So, we had better not to because it could be too much.

dongjoon-hyun · 2024-07-12T18:13:36Z

cc @cloud-fan , too

wForget · 2024-07-13T02:01:03Z

[info]   Cause: org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot evaluate expression: base64(0x61616161616161616161616161616161616161616161616161616161616161616161616161616161616161616161616161616161616161616161)
[info]   at org.apache.spark.SparkException$.internalError(SparkException.scala:92)
[info]   at org.apache.spark.SparkException$.internalError(SparkException.scala:96)

can you check these test failures?

Looks like it's missing fb5697d#diff-fcaa047af256f3afb055c3e5d466d5a0fe94851bd6cc8f96ac04673d52e1a321

Should we backport #47017 or introduce this change in this PR?

yaooqinn · 2024-07-13T02:28:05Z

Should we backport #47017
+1 @wForget

dongjoon-hyun · 2024-07-14T19:46:08Z

However, SPARK-48658 was merged as an improvement JIRA, @yaooqinn . Do you mean we need to convert it as a bug fix?

dongjoon-hyun · 2024-07-14T19:47:17Z

If we need to change the issue type, please comment on your initial PR to get a consensus.

[SPARK-48658][SQL] Encode/Decode functions report coding errors instead of mojibake for unmappable characters #47017 .

yaooqinn · 2024-07-15T02:38:30Z

Okay, based on the information provided by @dongjoon-hyun and the Policy of backporting bugfiexes, I think we shall only fix the test errors instead of backporting SPARK-48658

yaooqinn

LGTM from my side. Let's wait a bit while to see if @dongjoon-hyun has any concerns.

@wForget Can you raise another PR to 'master' to add a migration guide for migrating 3.5.1 to 3.5.2?

wForget · 2024-07-15T06:22:45Z

@wForget Can you raise another PR to 'master' to add a migration guide for migrating 3.5.1 to 3.5.2?

Sure, I will do it later.

Backports #47303 to 3.5 ### What changes were proposed in this pull request? [[SPARK-47307](https://issues.apache.org/jira/browse/SPARK-47307)] Add a config to optionally chunk base64 strings ### Why are the changes needed? In #35110, it was incorrectly asserted that: > ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt This is not true as the previous code called: ```java public static byte[] encodeBase64(byte[] binaryData) ``` Which states: > Encodes binary data using the base64 algorithm but does not chunk the output. However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test suite. ### Was this patch authored or co-authored using generative AI tooling? No Closes #47325 from wForget/SPARK-47307_3.5. Lead-authored-by: wforget <643348094@qq.com> Co-authored-by: Ted Jenks <tedcj@palantir.com> Signed-off-by: Kent Yao <yao@apache.org>

yaooqinn · 2024-07-16T02:36:01Z

Merged to branch 3.5 for 3.5.2, thanks @wForget

dongjoon-hyun · 2024-07-16T07:31:04Z

Thank you, @wForget and @yaooqinn !

Kimahriman · 2024-08-30T13:55:44Z

FYI this changed the nullability of base64 to always be nullable instead of depending on the input, this broke some of our stateful streams with a schema mismatch error when upgrading from 3.5.1 to 3.5.2.

cloud-fan · 2024-09-02T06:51:17Z

@Kimahriman thanks for reporting! I'm fixing it at #47952

github-actions bot added the SQL label Jul 12, 2024

yaooqinn approved these changes Jul 12, 2024

View reviewed changes

regenerate golden files

0662d66

github-actions bot added the CONNECT label Jul 12, 2024

dongjoon-hyun reviewed Jul 12, 2024

View reviewed changes

fix

f1da1fb

yaooqinn approved these changes Jul 15, 2024

View reviewed changes

yaooqinn mentioned this pull request Jul 15, 2024

[SPARK-47172][DOCS][FOLLOWUP] Fix spark.network.crypto.cipher since version field on security page #47353

Closed

yaooqinn closed this Jul 16, 2024

cloud-fan mentioned this pull request Sep 2, 2024

[SPARK-47307][SQL][FOLLOWUP] Fix Base64#nullable #47952

Closed

[SPARK-47307][SQL][3.5] Add a config to optionally chunk base64 strings #47325

[SPARK-47307][SQL][3.5] Add a config to optionally chunk base64 strings #47325

Uh oh!

Conversation

wForget commented Jul 12, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

yaooqinn commented Jul 12, 2024

Uh oh!

dongjoon-hyun Jul 12, 2024

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jul 12, 2024

Choose a reason for hiding this comment

Uh oh!

yaooqinn Jul 13, 2024

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Jul 14, 2024

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jul 12, 2024

Uh oh!

wForget commented Jul 13, 2024

Uh oh!

yaooqinn commented Jul 13, 2024

Uh oh!

dongjoon-hyun commented Jul 14, 2024

Uh oh!

dongjoon-hyun commented Jul 14, 2024

Uh oh!

yaooqinn commented Jul 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yaooqinn left a comment

Choose a reason for hiding this comment

Uh oh!

wForget commented Jul 15, 2024

Uh oh!

yaooqinn commented Jul 16, 2024

Uh oh!

dongjoon-hyun commented Jul 16, 2024

Uh oh!

Kimahriman commented Aug 30, 2024

Uh oh!

cloud-fan commented Sep 2, 2024

Uh oh!

Uh oh!

yaooqinn commented Jul 15, 2024 •

edited

Loading