Skip to content

[SPARK-47307][SQL] Add a config to optionally chunk base64 strings #45408

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from

Conversation

ted-jenks
Copy link
Contributor

@ted-jenks ted-jenks commented Mar 6, 2024

What changes were proposed in this pull request?

[SPARK-47307] Add a config to optionally chunk base64 strings

Why are the changes needed?

In #35110, it was incorrectly asserted that:

ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt

This is not true as the previous code called:

public static byte[] encodeBase64(byte[] binaryData)

Which states:

Encodes binary data using the base64 algorithm but does not chunk the output.

However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing test suite.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Mar 6, 2024
@ted-jenks
Copy link
Contributor Author

@dongjoon-hyun please may you take a look. Caused a big data correctness issue for us.

@dongjoon-hyun
Copy link
Member

Hi, @ted-jenks . Could you elaborate your correctness situation a little more? It sounds like you have other systems to read Spark's data.

@ted-jenks
Copy link
Contributor Author

ted-jenks commented Mar 7, 2024

@dongjoon-hyun

It sounds like you have other systems to read Spark's data.

Correct. The issue was that from 3.2 to 3.3 there was a behavior change in the base64 encodings used in spark. Previously, they did not chunk. Now, they do. Chunked base64 cannot be read by non-MIME compatible base64 decoders causing the data output by Spark to be corrupt to systems following the normal base64 standard.

I think the best path forward is to use MIME encoding/decoding without chunking as this is the most fault tolerant meaning existing use-cases will not break, but the pre 3.3 base64 behavior is upheld.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Mar 7, 2024

Thank you for the confirmation, @ted-jenks . Well, in this case, it's too late to change the behavior again. Apache Spark 3.3 is already the EOL status since last year and I don't think we need to change the behavior for Apache Spark 3.4.3 and 3.5.2 because Apache Spark community didn't have such an official contract before. It would be great if you had participated the community at Apache Spark 3.3.0 RC votes at that time.

It sounds like you have other systems to read Spark's data.

Correct. The issue was that from 3.2 to 3.3 there was a behavior change in the base64 encodings used in spark. Previously, they did not chunk. Now, they do. Chunked base64 cannot be read by non-MIME compatible base64 decoders causing the data output by Spark to be corrupt to systems following the normal base64 standard.

I think the best path forward is to use MIME encoding/decoding without chunking as this is the most fault tolerant meaning existing use-cases will not break, but the pre 3.3 base64 behavior is upheld.

However, I understand and agree with @ted-jenks 's point as a nice-to-have of Apache Spark 4+ officially. In other words, if we want to merge this PR, we need to make it official from Apache Spark 4.0.0 and protect that as a kind of developer interface for all future releases. Do you think it's okay, @ted-jenks ?

BTW, how do you think about this proposal, @yaooqinn (the original author of #35110) and @cloud-fan and @HyukjinKwon ?

@yaooqinn
Copy link
Member

yaooqinn commented Mar 8, 2024

Thank you @dongjoon-hyun.

In such circumstances, I guess we can add a configuration for base64 classes to avoid breaking things again. AFAIK, Apache Hive also uses the JDK version, and I think the majority of Spark users talk to Hive heavily using Spark SQL.

@dongjoon-hyun
Copy link
Member

+1 for the direction if we need to support both.

@yaooqinn
Copy link
Member

yaooqinn commented Mar 8, 2024

As the Spark Community didn't get any issue report during v3.3.0 - v3.5.1 releases, I think this is a corner case. Maybe we can make the config internal.

@ted-jenks
Copy link
Contributor Author

I think making this configurable makes the most sense. For people processing data for external systems with Spark they can choose to chunk or not chunk data based on what the use-case is.

${classOf[JBase64].getName}.getMimeEncoder().encode($child));
"""})
s"""
if ($chunkBase64) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We know the value of chunkBase64 before generating the java code, so we can do better

if (chunkBase64) {
  s""" ...
} else {
  s""" ...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-47307] Replace RFC 2045 base64 encoder with RFC 4648 encoder [SPARK-47307][SQL] Replace RFC 2045 base64 encoder with RFC 4648 encoder Mar 9, 2024
@ted-jenks ted-jenks changed the title [SPARK-47307][SQL] Replace RFC 2045 base64 encoder with RFC 4648 encoder [SPARK-47307][SQL] Add a config to optionally chunk base64 strings Mar 11, 2024
@yaooqinn
Copy link
Member

Do we need to revise unbase64 accordingly?

@ted-jenks
Copy link
Contributor Author

Do we need to revise unbase64 accordingly?

Unbase64 uses the Mime decoder, which can tolerate chunked and unchunked data.

@yaooqinn
Copy link
Member

thank you for the explanation @ted-jenks

@ted-jenks
Copy link
Contributor Author

I am having trouble getting the failing test to pass:
13:27:04.051 ERROR org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite
Giving:

java.sql.SQLException
[info]   org.apache.hive.service.cli.HiveSQLException: Error running query: [WRONG_NUM_ARGS.WITHOUT_SUGGESTION] org.apache.spark.sql.AnalysisException: [WRONG_NUM_ARGS.WITHOUT_SUGGESTION] The `base64` requires 0 parameters but the actual number is 1. Please, refer to 'https://spark.apache.org/docs/latest/sql-ref-functions.html' for a fix. SQLSTATE: 42605; line 1 pos 7
[info]   	at org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:43)
[info]   	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation...

Any idea why I would get this?

@dongjoon-hyun
Copy link
Member

Could you do the following to re-generate the golden files, @ted-jenks ?

SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"

+- SubqueryAlias spark_catalog.default.char_tbl4
+- Project [staticinvoke(class org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, readSidePadding, c7#x, 7, true, false, true) AS c7#x, staticinvoke(class org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, readSidePadding, c8#x, 8, true, false, true) AS c8#x, v#x, s#x]
+- Relation spark_catalog.default.char_tbl4[c7#x,c8#x,v#x,s#x] parquet
org.apache.spark.sql.AnalysisException
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, this seems to be a wrong regeneration due to the improper implementation.

TmV0RWEgIA== TmV0RWEgICA= U3Bhcmsg 78
TmV0RWFzIA== TmV0RWFzICA= U3Bhcms= 78
TmV0RWFzZQ== TmV0RWFzZSA= U3Bhcmst 78
org.apache.spark.sql.AnalysisException
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto. This seems to be a wrong regeneration due to the improper implementation.

@yaooqinn
Copy link
Member

Gentle ping @ted-jenks, any updates for the test failures?

@dongjoon-hyun
Copy link
Member

Gentle ping, @ted-jenks .

@yaooqinn
Copy link
Member

Failed to get ahold of @ted-jenks, I'm pinging someone to take this over if you don't mind

@yaooqinn
Copy link
Member

cc @wForget

@wForget
Copy link
Member

wForget commented Jul 11, 2024

Failed to get ahold of @ted-jenks, I'm pinging someone to take this over if you don't mind

Thank you @ted-jenks for this work. I have submitted a new PR to continue working on it. #47303

yaooqinn added a commit that referenced this pull request Jul 12, 2024
Follow up #45408

### What changes were proposed in this pull request?
[[SPARK-47307](https://issues.apache.org/jira/browse/SPARK-47307)] Add a config to optionally chunk base64 strings

### Why are the changes needed?
In #35110, it was incorrectly asserted that:

> ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt

This is not true as the previous code called:

```java
public static byte[] encodeBase64(byte[] binaryData)
```

Which states:

> Encodes binary data using the base64 algorithm but does not chunk the output.

However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing test suite.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #47303 from wForget/SPARK-47307.

Lead-authored-by: Ted Jenks <tedcj@palantir.com>
Co-authored-by: wforget <643348094@qq.com>
Co-authored-by: Kent Yao <yao@apache.org>
Co-authored-by: Ted Chester Jenks <tedcj@palantir.com>
Signed-off-by: Kent Yao <yao@apache.org>
@yaooqinn
Copy link
Member

Closed via #47303

@yaooqinn yaooqinn closed this Jul 12, 2024
wForget pushed a commit to wForget/spark that referenced this pull request Jul 12, 2024
Follow up apache#45408

### What changes were proposed in this pull request?
[[SPARK-47307](https://issues.apache.org/jira/browse/SPARK-47307)] Add a config to optionally chunk base64 strings

### Why are the changes needed?
In apache#35110, it was incorrectly asserted that:

> ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt

This is not true as the previous code called:

```java
public static byte[] encodeBase64(byte[] binaryData)
```

Which states:

> Encodes binary data using the base64 algorithm but does not chunk the output.

However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing test suite.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47303 from wForget/SPARK-47307.

Lead-authored-by: Ted Jenks <tedcj@palantir.com>
Co-authored-by: wforget <643348094@qq.com>
Co-authored-by: Kent Yao <yao@apache.org>
Co-authored-by: Ted Chester Jenks <tedcj@palantir.com>
Signed-off-by: Kent Yao <yao@apache.org>

(cherry picked from commit 8d3d4f9)
jingz-db pushed a commit to jingz-db/spark that referenced this pull request Jul 22, 2024
Follow up apache#45408

### What changes were proposed in this pull request?
[[SPARK-47307](https://issues.apache.org/jira/browse/SPARK-47307)] Add a config to optionally chunk base64 strings

### Why are the changes needed?
In apache#35110, it was incorrectly asserted that:

> ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt

This is not true as the previous code called:

```java
public static byte[] encodeBase64(byte[] binaryData)
```

Which states:

> Encodes binary data using the base64 algorithm but does not chunk the output.

However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing test suite.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47303 from wForget/SPARK-47307.

Lead-authored-by: Ted Jenks <tedcj@palantir.com>
Co-authored-by: wforget <643348094@qq.com>
Co-authored-by: Kent Yao <yao@apache.org>
Co-authored-by: Ted Chester Jenks <tedcj@palantir.com>
Signed-off-by: Kent Yao <yao@apache.org>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
Follow up apache#45408

### What changes were proposed in this pull request?
[[SPARK-47307](https://issues.apache.org/jira/browse/SPARK-47307)] Add a config to optionally chunk base64 strings

### Why are the changes needed?
In apache#35110, it was incorrectly asserted that:

> ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt

This is not true as the previous code called:

```java
public static byte[] encodeBase64(byte[] binaryData)
```

Which states:

> Encodes binary data using the base64 algorithm but does not chunk the output.

However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing test suite.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47303 from wForget/SPARK-47307.

Lead-authored-by: Ted Jenks <tedcj@palantir.com>
Co-authored-by: wforget <643348094@qq.com>
Co-authored-by: Kent Yao <yao@apache.org>
Co-authored-by: Ted Chester Jenks <tedcj@palantir.com>
Signed-off-by: Kent Yao <yao@apache.org>
himadripal pushed a commit to himadripal/spark that referenced this pull request Oct 19, 2024
Follow up apache#45408

### What changes were proposed in this pull request?
[[SPARK-47307](https://issues.apache.org/jira/browse/SPARK-47307)] Add a config to optionally chunk base64 strings

### Why are the changes needed?
In apache#35110, it was incorrectly asserted that:

> ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt

This is not true as the previous code called:

```java
public static byte[] encodeBase64(byte[] binaryData)
```

Which states:

> Encodes binary data using the base64 algorithm but does not chunk the output.

However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing test suite.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#47303 from wForget/SPARK-47307.

Lead-authored-by: Ted Jenks <tedcj@palantir.com>
Co-authored-by: wforget <643348094@qq.com>
Co-authored-by: Kent Yao <yao@apache.org>
Co-authored-by: Ted Chester Jenks <tedcj@palantir.com>
Signed-off-by: Kent Yao <yao@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants