Skip to content

[SPARK-47307][SQL][3.5] Add a config to optionally chunk base64 strings #47325

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

wForget
Copy link
Member

@wForget wForget commented Jul 12, 2024

Backports #47303 to 3.5

What changes were proposed in this pull request?

[SPARK-47307] Add a config to optionally chunk base64 strings

Why are the changes needed?

In #35110, it was incorrectly asserted that:

ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt

This is not true as the previous code called:

public static byte[] encodeBase64(byte[] binaryData)

Which states:

Encodes binary data using the base64 algorithm but does not chunk the output.

However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing test suite.

Was this patch authored or co-authored using generative AI tooling?

No

Follow up apache#45408

### What changes were proposed in this pull request?
[[SPARK-47307](https://issues.apache.org/jira/browse/SPARK-47307)] Add a config to optionally chunk base64 strings

### Why are the changes needed?
In apache#35110, it was incorrectly asserted that:

> ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt

This is not true as the previous code called:

```java
public static byte[] encodeBase64(byte[] binaryData)
```

Which states:

> Encodes binary data using the base64 algorithm but does not chunk the output.

However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing test suite.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47303 from wForget/SPARK-47307.

Lead-authored-by: Ted Jenks <tedcj@palantir.com>
Co-authored-by: wforget <643348094@qq.com>
Co-authored-by: Kent Yao <yao@apache.org>
Co-authored-by: Ted Chester Jenks <tedcj@palantir.com>
Signed-off-by: Kent Yao <yao@apache.org>

(cherry picked from commit 8d3d4f9)
@github-actions github-actions bot added the SQL label Jul 12, 2024
@yaooqinn
Copy link
Member

[info]   Cause: org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot evaluate expression: base64(0x61616161616161616161616161616161616161616161616161616161616161616161616161616161616161616161616161616161616161616161)
[info]   at org.apache.spark.SparkException$.internalError(SparkException.scala:92)
[info]   at org.apache.spark.SparkException$.internalError(SparkException.scala:96)

can you check these test failures?

.doc("Whether to truncate string generated by the `Base64` function. When true, base64" +
" strings generated by the base64 function are chunked into lines of at most 76" +
" characters. When false, the base64 strings are not chunked.")
.version("3.5.2")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, if we start to backport SPARK-47307, it will go to Apache Spark 3.4.4 together, right? In that case, I'm curious if 3.5.2 is correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. Got it. I saw this comment from Apache Spark 3.5.2 release manager, @yaooqinn .

If then, are we going to update these values from master and branch-3.5 when we do the release of Apache Spark 3.4.4? I'm fine if we are going to do in that way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @dongjoon-hyun.

Not related to this PR, maybe we shall add multiple fixed version in this field, such as 3.4.4, 3.5.2, 4.0.0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically, it's possible, but it will enforce us to update all the existing configurations and documentations. So, we had better not to because it could be too much.

@dongjoon-hyun
Copy link
Member

cc @cloud-fan , too

@wForget
Copy link
Member Author

wForget commented Jul 13, 2024

[info]   Cause: org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot evaluate expression: base64(0x61616161616161616161616161616161616161616161616161616161616161616161616161616161616161616161616161616161616161616161)
[info]   at org.apache.spark.SparkException$.internalError(SparkException.scala:92)
[info]   at org.apache.spark.SparkException$.internalError(SparkException.scala:96)

can you check these test failures?

Looks like it's missing fb5697d#diff-fcaa047af256f3afb055c3e5d466d5a0fe94851bd6cc8f96ac04673d52e1a321

Should we backport #47017 or introduce this change in this PR?

@yaooqinn
Copy link
Member

Should we backport #47017
+1 @wForget

@dongjoon-hyun
Copy link
Member

However, SPARK-48658 was merged as an improvement JIRA, @yaooqinn . Do you mean we need to convert it as a bug fix?

Screenshot 2024-07-14 at 12 45 23

@dongjoon-hyun
Copy link
Member

If we need to change the issue type, please comment on your initial PR to get a consensus.

@yaooqinn
Copy link
Member

yaooqinn commented Jul 15, 2024

Okay, based on the information provided by @dongjoon-hyun and the Policy of backporting bugfiexes, I think we shall only fix the test errors instead of backporting SPARK-48658

Copy link
Member

@yaooqinn yaooqinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from my side. Let's wait a bit while to see if @dongjoon-hyun has any concerns.

@wForget Can you raise another PR to 'master' to add a migration guide for migrating 3.5.1 to 3.5.2?

@wForget
Copy link
Member Author

wForget commented Jul 15, 2024

@wForget Can you raise another PR to 'master' to add a migration guide for migrating 3.5.1 to 3.5.2?

Sure, I will do it later.

yaooqinn pushed a commit that referenced this pull request Jul 16, 2024
Backports #47303 to 3.5

### What changes were proposed in this pull request?

[[SPARK-47307](https://issues.apache.org/jira/browse/SPARK-47307)] Add a config to optionally chunk base64 strings

### Why are the changes needed?

In #35110, it was incorrectly asserted that:

> ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt

This is not true as the previous code called:

```java
public static byte[] encodeBase64(byte[] binaryData)
```

Which states:

> Encodes binary data using the base64 algorithm but does not chunk the output.

However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing test suite.

### Was this patch authored or co-authored using generative AI tooling?

 No

Closes #47325 from wForget/SPARK-47307_3.5.

Lead-authored-by: wforget <643348094@qq.com>
Co-authored-by: Ted Jenks <tedcj@palantir.com>
Signed-off-by: Kent Yao <yao@apache.org>
@yaooqinn
Copy link
Member

Merged to branch 3.5 for 3.5.2, thanks @wForget

@yaooqinn yaooqinn closed this Jul 16, 2024
@dongjoon-hyun
Copy link
Member

Thank you, @wForget and @yaooqinn !

@Kimahriman
Copy link
Contributor

FYI this changed the nullability of base64 to always be nullable instead of depending on the input, this broke some of our stateful streams with a schema mismatch error when upgrading from 3.5.1 to 3.5.2.

@cloud-fan
Copy link
Contributor

@Kimahriman thanks for reporting! I'm fixing it at #47952

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants