-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-37820][SQL] Replace ApacheCommonBase64 with JavaBase64 for string funcs #35110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cc @cloud-fan @dongjoon-hyun @HyukjinKwon, thanks very much |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @yaooqinn . I have two requests.
- Could you regenerate the benchmark result by using GitHub Action?
- Could you add Java 17 benchmark result?
For the others, it looks good.
Thanks, @dongjoon-hyun any instructions for triggering this step? |
Yes, Here it is, |
I have found the guide, thanks |
Extraordinary feature! @dongjoon-hyun @HyukjinKwon
|
Are the benchmark result ready, @yaooqinn ? :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Merged to master.
Thank you, @yaooqinn , @cloud-fan , @HyukjinKwon .
…ing funcs ### What changes were proposed in this pull request? Replace Base64 on ApacheCommonBase64 with native support (https://docs.oracle.com/javase/8/docs/api/java/util/Base64.html) for Base-64 encode/decode. ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt with url-safe off according to its doc, so we choose `java.util.Base64.Decoder#RFC2045` ### Why are the changes needed? 1. Performace gain - http://java-performance.info/base64-encoding-and-decoding-performance/ - have done benchmarks against jdk8/11, shown 2-5x faster 2. reduce dependencies after we replace other related places ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - Existing StringExpressionSuite - benchmarks Closes apache#35110 from yaooqinn/SPARK-37820. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Follow up #45408 ### What changes were proposed in this pull request? [[SPARK-47307](https://issues.apache.org/jira/browse/SPARK-47307)] Add a config to optionally chunk base64 strings ### Why are the changes needed? In #35110, it was incorrectly asserted that: > ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt This is not true as the previous code called: ```java public static byte[] encodeBase64(byte[] binaryData) ``` Which states: > Encodes binary data using the base64 algorithm but does not chunk the output. However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test suite. ### Was this patch authored or co-authored using generative AI tooling? No Closes #47303 from wForget/SPARK-47307. Lead-authored-by: Ted Jenks <tedcj@palantir.com> Co-authored-by: wforget <643348094@qq.com> Co-authored-by: Kent Yao <yao@apache.org> Co-authored-by: Ted Chester Jenks <tedcj@palantir.com> Signed-off-by: Kent Yao <yao@apache.org>
Follow up apache#45408 ### What changes were proposed in this pull request? [[SPARK-47307](https://issues.apache.org/jira/browse/SPARK-47307)] Add a config to optionally chunk base64 strings ### Why are the changes needed? In apache#35110, it was incorrectly asserted that: > ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt This is not true as the previous code called: ```java public static byte[] encodeBase64(byte[] binaryData) ``` Which states: > Encodes binary data using the base64 algorithm but does not chunk the output. However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test suite. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47303 from wForget/SPARK-47307. Lead-authored-by: Ted Jenks <tedcj@palantir.com> Co-authored-by: wforget <643348094@qq.com> Co-authored-by: Kent Yao <yao@apache.org> Co-authored-by: Ted Chester Jenks <tedcj@palantir.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 8d3d4f9)
Backports #47303 to 3.5 ### What changes were proposed in this pull request? [[SPARK-47307](https://issues.apache.org/jira/browse/SPARK-47307)] Add a config to optionally chunk base64 strings ### Why are the changes needed? In #35110, it was incorrectly asserted that: > ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt This is not true as the previous code called: ```java public static byte[] encodeBase64(byte[] binaryData) ``` Which states: > Encodes binary data using the base64 algorithm but does not chunk the output. However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test suite. ### Was this patch authored or co-authored using generative AI tooling? No Closes #47325 from wForget/SPARK-47307_3.5. Lead-authored-by: wforget <643348094@qq.com> Co-authored-by: Ted Jenks <tedcj@palantir.com> Signed-off-by: Kent Yao <yao@apache.org>
Follow up apache#45408 ### What changes were proposed in this pull request? [[SPARK-47307](https://issues.apache.org/jira/browse/SPARK-47307)] Add a config to optionally chunk base64 strings ### Why are the changes needed? In apache#35110, it was incorrectly asserted that: > ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt This is not true as the previous code called: ```java public static byte[] encodeBase64(byte[] binaryData) ``` Which states: > Encodes binary data using the base64 algorithm but does not chunk the output. However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test suite. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47303 from wForget/SPARK-47307. Lead-authored-by: Ted Jenks <tedcj@palantir.com> Co-authored-by: wforget <643348094@qq.com> Co-authored-by: Kent Yao <yao@apache.org> Co-authored-by: Ted Chester Jenks <tedcj@palantir.com> Signed-off-by: Kent Yao <yao@apache.org>
Follow up apache#45408 ### What changes were proposed in this pull request? [[SPARK-47307](https://issues.apache.org/jira/browse/SPARK-47307)] Add a config to optionally chunk base64 strings ### Why are the changes needed? In apache#35110, it was incorrectly asserted that: > ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt This is not true as the previous code called: ```java public static byte[] encodeBase64(byte[] binaryData) ``` Which states: > Encodes binary data using the base64 algorithm but does not chunk the output. However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test suite. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47303 from wForget/SPARK-47307. Lead-authored-by: Ted Jenks <tedcj@palantir.com> Co-authored-by: wforget <643348094@qq.com> Co-authored-by: Kent Yao <yao@apache.org> Co-authored-by: Ted Chester Jenks <tedcj@palantir.com> Signed-off-by: Kent Yao <yao@apache.org>
Follow up apache#45408 ### What changes were proposed in this pull request? [[SPARK-47307](https://issues.apache.org/jira/browse/SPARK-47307)] Add a config to optionally chunk base64 strings ### Why are the changes needed? In apache#35110, it was incorrectly asserted that: > ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt This is not true as the previous code called: ```java public static byte[] encodeBase64(byte[] binaryData) ``` Which states: > Encodes binary data using the base64 algorithm but does not chunk the output. However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test suite. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47303 from wForget/SPARK-47307. Lead-authored-by: Ted Jenks <tedcj@palantir.com> Co-authored-by: wforget <643348094@qq.com> Co-authored-by: Kent Yao <yao@apache.org> Co-authored-by: Ted Chester Jenks <tedcj@palantir.com> Signed-off-by: Kent Yao <yao@apache.org>
What changes were proposed in this pull request?
Replace Base64 on ApacheCommonBase64 with native support (https://docs.oracle.com/javase/8/docs/api/java/util/Base64.html) for Base-64 encode/decode.
ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt with url-safe off according to its doc, so we choose
java.util.Base64.Decoder#RFC2045
Why are the changes needed?
Does this PR introduce any user-facing change?
NO
How was this patch tested?