-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-47307][SQL][3.5] Add a config to optionally chunk base64 strings #47325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Follow up apache#45408 ### What changes were proposed in this pull request? [[SPARK-47307](https://issues.apache.org/jira/browse/SPARK-47307)] Add a config to optionally chunk base64 strings ### Why are the changes needed? In apache#35110, it was incorrectly asserted that: > ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt This is not true as the previous code called: ```java public static byte[] encodeBase64(byte[] binaryData) ``` Which states: > Encodes binary data using the base64 algorithm but does not chunk the output. However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test suite. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47303 from wForget/SPARK-47307. Lead-authored-by: Ted Jenks <tedcj@palantir.com> Co-authored-by: wforget <643348094@qq.com> Co-authored-by: Kent Yao <yao@apache.org> Co-authored-by: Ted Chester Jenks <tedcj@palantir.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 8d3d4f9)
can you check these test failures? |
.doc("Whether to truncate string generated by the `Base64` function. When true, base64" + | ||
" strings generated by the base64 function are chunked into lines of at most 76" + | ||
" characters. When false, the base64 strings are not chunked.") | ||
.version("3.5.2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, if we start to backport SPARK-47307, it will go to Apache Spark 3.4.4 together, right? In that case, I'm curious if 3.5.2 is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Got it. I saw this comment from Apache Spark 3.5.2 release manager, @yaooqinn .
If then, are we going to update these values from master
and branch-3.5
when we do the release of Apache Spark 3.4.4? I'm fine if we are going to do in that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @dongjoon-hyun.
Not related to this PR, maybe we shall add multiple fixed version in this field, such as 3.4.4, 3.5.2, 4.0.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Theoretically, it's possible, but it will enforce us to update all the existing configurations and documentations. So, we had better not to because it could be too much.
cc @cloud-fan , too |
Looks like it's missing fb5697d#diff-fcaa047af256f3afb055c3e5d466d5a0fe94851bd6cc8f96ac04673d52e1a321 Should we backport #47017 or introduce this change in this PR? |
However, SPARK-48658 was merged as an improvement JIRA, @yaooqinn . Do you mean we need to convert it as a bug fix? |
If we need to change the issue type, please comment on your initial PR to get a consensus. |
Okay, based on the information provided by @dongjoon-hyun and the Policy of backporting bugfiexes, I think we shall only fix the test errors instead of backporting SPARK-48658 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM from my side. Let's wait a bit while to see if @dongjoon-hyun has any concerns.
@wForget Can you raise another PR to 'master' to add a migration guide for migrating 3.5.1 to 3.5.2?
Sure, I will do it later. |
Backports #47303 to 3.5 ### What changes were proposed in this pull request? [[SPARK-47307](https://issues.apache.org/jira/browse/SPARK-47307)] Add a config to optionally chunk base64 strings ### Why are the changes needed? In #35110, it was incorrectly asserted that: > ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt This is not true as the previous code called: ```java public static byte[] encodeBase64(byte[] binaryData) ``` Which states: > Encodes binary data using the base64 algorithm but does not chunk the output. However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test suite. ### Was this patch authored or co-authored using generative AI tooling? No Closes #47325 from wForget/SPARK-47307_3.5. Lead-authored-by: wforget <643348094@qq.com> Co-authored-by: Ted Jenks <tedcj@palantir.com> Signed-off-by: Kent Yao <yao@apache.org>
Merged to branch 3.5 for 3.5.2, thanks @wForget |
FYI this changed the nullability of base64 to always be nullable instead of depending on the input, this broke some of our stateful streams with a schema mismatch error when upgrading from 3.5.1 to 3.5.2. |
@Kimahriman thanks for reporting! I'm fixing it at #47952 |
Backports #47303 to 3.5
What changes were proposed in this pull request?
[SPARK-47307] Add a config to optionally chunk base64 strings
Why are the changes needed?
In #35110, it was incorrectly asserted that:
This is not true as the previous code called:
Which states:
However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Existing test suite.
Was this patch authored or co-authored using generative AI tooling?
No