Skip to content

[SPARK-37820][SQL] Replace ApacheCommonBase64 with JavaBase64 for string funcs #35110

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

yaooqinn
Copy link
Member

@yaooqinn yaooqinn commented Jan 6, 2022

What changes were proposed in this pull request?

Replace Base64 on ApacheCommonBase64 with native support (https://docs.oracle.com/javase/8/docs/api/java/util/Base64.html) for Base-64 encode/decode.

ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt with url-safe off according to its doc, so we choose java.util.Base64.Decoder#RFC2045

Why are the changes needed?

  1. Performace gain
  1. reduce dependencies after we replace other related places

Does this PR introduce any user-facing change?

NO

How was this patch tested?

  • Existing StringExpressionSuite
  • benchmarks

@github-actions github-actions bot added the SQL label Jan 6, 2022
@yaooqinn
Copy link
Member Author

yaooqinn commented Jan 6, 2022

cc @cloud-fan @dongjoon-hyun @HyukjinKwon, thanks very much

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @yaooqinn . I have two requests.

  • Could you regenerate the benchmark result by using GitHub Action?
  • Could you add Java 17 benchmark result?

For the others, it looks good.

@yaooqinn
Copy link
Member Author

yaooqinn commented Jan 6, 2022

Could you regenerate the benchmark result by using GitHub Action?

Thanks, @dongjoon-hyun any instructions for triggering this step?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jan 6, 2022

Yes, Here it is, Running benchmarks in your forked repository. It's easy and nice tool which @HyukjinKwon contributed.

@yaooqinn
Copy link
Member Author

yaooqinn commented Jan 6, 2022

I have found the guide, thanks

@yaooqinn
Copy link
Member Author

yaooqinn commented Jan 6, 2022

Extraordinary feature! @dongjoon-hyun @HyukjinKwon

Yes, Here it is, Running benchmarks in your forked repository. It's easy and nice tool which @HyukjinKwon contributed.

@dongjoon-hyun
Copy link
Member

Are the benchmark result ready, @yaooqinn ? :)

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Merged to master.
Thank you, @yaooqinn , @cloud-fan , @HyukjinKwon .

@yaooqinn yaooqinn deleted the SPARK-37820 branch January 7, 2022 05:40
dchvn pushed a commit to dchvn/spark that referenced this pull request Jan 19, 2022
…ing funcs

### What changes were proposed in this pull request?

Replace Base64 on ApacheCommonBase64 with native support (https://docs.oracle.com/javase/8/docs/api/java/util/Base64.html)  for Base-64 encode/decode.

ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt with url-safe off according to its doc, so we choose `java.util.Base64.Decoder#RFC2045`

### Why are the changes needed?

1. Performace gain
  - http://java-performance.info/base64-encoding-and-decoding-performance/
  - have done benchmarks against jdk8/11, shown 2-5x faster

2. reduce dependencies after we replace other related places

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

- Existing StringExpressionSuite
- benchmarks

Closes apache#35110 from yaooqinn/SPARK-37820.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
yaooqinn added a commit that referenced this pull request Jul 12, 2024
Follow up #45408

### What changes were proposed in this pull request?
[[SPARK-47307](https://issues.apache.org/jira/browse/SPARK-47307)] Add a config to optionally chunk base64 strings

### Why are the changes needed?
In #35110, it was incorrectly asserted that:

> ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt

This is not true as the previous code called:

```java
public static byte[] encodeBase64(byte[] binaryData)
```

Which states:

> Encodes binary data using the base64 algorithm but does not chunk the output.

However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing test suite.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #47303 from wForget/SPARK-47307.

Lead-authored-by: Ted Jenks <tedcj@palantir.com>
Co-authored-by: wforget <643348094@qq.com>
Co-authored-by: Kent Yao <yao@apache.org>
Co-authored-by: Ted Chester Jenks <tedcj@palantir.com>
Signed-off-by: Kent Yao <yao@apache.org>
wForget pushed a commit to wForget/spark that referenced this pull request Jul 12, 2024
Follow up apache#45408

### What changes were proposed in this pull request?
[[SPARK-47307](https://issues.apache.org/jira/browse/SPARK-47307)] Add a config to optionally chunk base64 strings

### Why are the changes needed?
In apache#35110, it was incorrectly asserted that:

> ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt

This is not true as the previous code called:

```java
public static byte[] encodeBase64(byte[] binaryData)
```

Which states:

> Encodes binary data using the base64 algorithm but does not chunk the output.

However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing test suite.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47303 from wForget/SPARK-47307.

Lead-authored-by: Ted Jenks <tedcj@palantir.com>
Co-authored-by: wforget <643348094@qq.com>
Co-authored-by: Kent Yao <yao@apache.org>
Co-authored-by: Ted Chester Jenks <tedcj@palantir.com>
Signed-off-by: Kent Yao <yao@apache.org>

(cherry picked from commit 8d3d4f9)
yaooqinn pushed a commit that referenced this pull request Jul 16, 2024
Backports #47303 to 3.5

### What changes were proposed in this pull request?

[[SPARK-47307](https://issues.apache.org/jira/browse/SPARK-47307)] Add a config to optionally chunk base64 strings

### Why are the changes needed?

In #35110, it was incorrectly asserted that:

> ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt

This is not true as the previous code called:

```java
public static byte[] encodeBase64(byte[] binaryData)
```

Which states:

> Encodes binary data using the base64 algorithm but does not chunk the output.

However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing test suite.

### Was this patch authored or co-authored using generative AI tooling?

 No

Closes #47325 from wForget/SPARK-47307_3.5.

Lead-authored-by: wforget <643348094@qq.com>
Co-authored-by: Ted Jenks <tedcj@palantir.com>
Signed-off-by: Kent Yao <yao@apache.org>
jingz-db pushed a commit to jingz-db/spark that referenced this pull request Jul 22, 2024
Follow up apache#45408

### What changes were proposed in this pull request?
[[SPARK-47307](https://issues.apache.org/jira/browse/SPARK-47307)] Add a config to optionally chunk base64 strings

### Why are the changes needed?
In apache#35110, it was incorrectly asserted that:

> ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt

This is not true as the previous code called:

```java
public static byte[] encodeBase64(byte[] binaryData)
```

Which states:

> Encodes binary data using the base64 algorithm but does not chunk the output.

However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing test suite.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47303 from wForget/SPARK-47307.

Lead-authored-by: Ted Jenks <tedcj@palantir.com>
Co-authored-by: wforget <643348094@qq.com>
Co-authored-by: Kent Yao <yao@apache.org>
Co-authored-by: Ted Chester Jenks <tedcj@palantir.com>
Signed-off-by: Kent Yao <yao@apache.org>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
Follow up apache#45408

### What changes were proposed in this pull request?
[[SPARK-47307](https://issues.apache.org/jira/browse/SPARK-47307)] Add a config to optionally chunk base64 strings

### Why are the changes needed?
In apache#35110, it was incorrectly asserted that:

> ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt

This is not true as the previous code called:

```java
public static byte[] encodeBase64(byte[] binaryData)
```

Which states:

> Encodes binary data using the base64 algorithm but does not chunk the output.

However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing test suite.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47303 from wForget/SPARK-47307.

Lead-authored-by: Ted Jenks <tedcj@palantir.com>
Co-authored-by: wforget <643348094@qq.com>
Co-authored-by: Kent Yao <yao@apache.org>
Co-authored-by: Ted Chester Jenks <tedcj@palantir.com>
Signed-off-by: Kent Yao <yao@apache.org>
himadripal pushed a commit to himadripal/spark that referenced this pull request Oct 19, 2024
Follow up apache#45408

### What changes were proposed in this pull request?
[[SPARK-47307](https://issues.apache.org/jira/browse/SPARK-47307)] Add a config to optionally chunk base64 strings

### Why are the changes needed?
In apache#35110, it was incorrectly asserted that:

> ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt

This is not true as the previous code called:

```java
public static byte[] encodeBase64(byte[] binaryData)
```

Which states:

> Encodes binary data using the base64 algorithm but does not chunk the output.

However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing test suite.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47303 from wForget/SPARK-47307.

Lead-authored-by: Ted Jenks <tedcj@palantir.com>
Co-authored-by: wforget <643348094@qq.com>
Co-authored-by: Kent Yao <yao@apache.org>
Co-authored-by: Ted Chester Jenks <tedcj@palantir.com>
Signed-off-by: Kent Yao <yao@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants