[SPARK-47307][SQL] Add a config to optionally chunk base64 strings #45408

ted-jenks · 2024-03-06T18:27:13Z

What changes were proposed in this pull request?

[SPARK-47307] Add a config to optionally chunk base64 strings

Why are the changes needed?

In #35110, it was incorrectly asserted that:

ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt

This is not true as the previous code called:

public static byte[] encodeBase64(byte[] binaryData)

Which states:

Encodes binary data using the base64 algorithm but does not chunk the output.

However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing test suite.

Was this patch authored or co-authored using generative AI tooling?

No

ted-jenks · 2024-03-06T18:36:40Z

@dongjoon-hyun please may you take a look. Caused a big data correctness issue for us.

dongjoon-hyun · 2024-03-06T21:35:48Z

Hi, @ted-jenks . Could you elaborate your correctness situation a little more? It sounds like you have other systems to read Spark's data.

ted-jenks · 2024-03-07T08:00:59Z

@dongjoon-hyun

It sounds like you have other systems to read Spark's data.

Correct. The issue was that from 3.2 to 3.3 there was a behavior change in the base64 encodings used in spark. Previously, they did not chunk. Now, they do. Chunked base64 cannot be read by non-MIME compatible base64 decoders causing the data output by Spark to be corrupt to systems following the normal base64 standard.

I think the best path forward is to use MIME encoding/decoding without chunking as this is the most fault tolerant meaning existing use-cases will not break, but the pre 3.3 base64 behavior is upheld.

dongjoon-hyun · 2024-03-07T20:53:40Z

Thank you for the confirmation, @ted-jenks . Well, in this case, it's too late to change the behavior again. Apache Spark 3.3 is already the EOL status since last year and I don't think we need to change the behavior for Apache Spark 3.4.3 and 3.5.2 because Apache Spark community didn't have such an official contract before. It would be great if you had participated the community at Apache Spark 3.3.0 RC votes at that time.

It sounds like you have other systems to read Spark's data.

Correct. The issue was that from 3.2 to 3.3 there was a behavior change in the base64 encodings used in spark. Previously, they did not chunk. Now, they do. Chunked base64 cannot be read by non-MIME compatible base64 decoders causing the data output by Spark to be corrupt to systems following the normal base64 standard.

I think the best path forward is to use MIME encoding/decoding without chunking as this is the most fault tolerant meaning existing use-cases will not break, but the pre 3.3 base64 behavior is upheld.

However, I understand and agree with @ted-jenks 's point as a nice-to-have of Apache Spark 4+ officially. In other words, if we want to merge this PR, we need to make it official from Apache Spark 4.0.0 and protect that as a kind of developer interface for all future releases. Do you think it's okay, @ted-jenks ?

BTW, how do you think about this proposal, @yaooqinn (the original author of #35110) and @cloud-fan and @HyukjinKwon ?

yaooqinn · 2024-03-08T02:14:13Z

Thank you @dongjoon-hyun.

In such circumstances, I guess we can add a configuration for base64 classes to avoid breaking things again. AFAIK, Apache Hive also uses the JDK version, and I think the majority of Spark users talk to Hive heavily using Spark SQL.

dongjoon-hyun · 2024-03-08T02:18:59Z

+1 for the direction if we need to support both.

yaooqinn · 2024-03-08T02:20:05Z

As the Spark Community didn't get any issue report during v3.3.0 - v3.5.1 releases, I think this is a corner case. Maybe we can make the config internal.

ted-jenks · 2024-03-08T10:17:09Z

I think making this configurable makes the most sense. For people processing data for external systems with Spark they can choose to chunk or not chunk data based on what the use-case is.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala

cloud-fan · 2024-03-08T14:57:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala

-            ${classOf[JBase64].getName}.getMimeEncoder().encode($child));
-       """})
+      s"""
+        if ($chunkBase64) {


We know the value of chunkBase64 before generating the java code, so we can do better

if (chunkBase64) { s""" ... } else { s""" ... }

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

yaooqinn · 2024-03-11T12:08:35Z

Do we need to revise unbase64 accordingly?

ted-jenks · 2024-03-11T12:52:09Z

Do we need to revise unbase64 accordingly?

Unbase64 uses the Mime decoder, which can tolerate chunked and unchunked data.

yaooqinn · 2024-03-11T13:05:22Z

thank you for the explanation @ted-jenks

ted-jenks · 2024-03-11T14:56:44Z

I am having trouble getting the failing test to pass:
13:27:04.051 ERROR org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite
Giving:

java.sql.SQLException
[info]   org.apache.hive.service.cli.HiveSQLException: Error running query: [WRONG_NUM_ARGS.WITHOUT_SUGGESTION] org.apache.spark.sql.AnalysisException: [WRONG_NUM_ARGS.WITHOUT_SUGGESTION] The `base64` requires 0 parameters but the actual number is 1. Please, refer to 'https://spark.apache.org/docs/latest/sql-ref-functions.html' for a fix. SQLSTATE: 42605; line 1 pos 7
[info]   	at org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:43)
[info]   	at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation...

Any idea why I would get this?

dongjoon-hyun · 2024-03-11T18:07:03Z

Could you do the following to re-generate the golden files, @ted-jenks ?

SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite"

dongjoon-hyun · 2024-03-12T20:13:42Z

sql/core/src/test/resources/sql-tests/analyzer-results/charvarchar.sql.out

-+- SubqueryAlias spark_catalog.default.char_tbl4
-   +- Project [staticinvoke(class org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, readSidePadding, c7#x, 7, true, false, true) AS c7#x, staticinvoke(class org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils, StringType, readSidePadding, c8#x, 8, true, false, true) AS c8#x, v#x, s#x]
-      +- Relation spark_catalog.default.char_tbl4[c7#x,c8#x,v#x,s#x] parquet
+org.apache.spark.sql.AnalysisException


Ur, this seems to be a wrong regeneration due to the improper implementation.

dongjoon-hyun · 2024-03-12T20:13:59Z

sql/core/src/test/resources/sql-tests/results/charvarchar.sql.out

-TmV0RWEgIA==	TmV0RWEgICA=	U3Bhcmsg	78
-TmV0RWFzIA==	TmV0RWFzICA=	U3Bhcms=	78
-TmV0RWFzZQ==	TmV0RWFzZSA=	U3Bhcmst	78
+org.apache.spark.sql.AnalysisException


ditto. This seems to be a wrong regeneration due to the improper implementation.

yaooqinn · 2024-03-20T06:22:15Z

Gentle ping @ted-jenks, any updates for the test failures?

dongjoon-hyun · 2024-05-07T17:29:05Z

Gentle ping, @ted-jenks .

yaooqinn · 2024-07-11T06:41:55Z

Failed to get ahold of @ted-jenks, I'm pinging someone to take this over if you don't mind

yaooqinn · 2024-07-11T08:11:01Z

cc @wForget

wForget · 2024-07-11T13:14:33Z

Failed to get ahold of @ted-jenks, I'm pinging someone to take this over if you don't mind

Thank you @ted-jenks for this work. I have submitted a new PR to continue working on it. #47303

Follow up #45408 ### What changes were proposed in this pull request? [[SPARK-47307](https://issues.apache.org/jira/browse/SPARK-47307)] Add a config to optionally chunk base64 strings ### Why are the changes needed? In #35110, it was incorrectly asserted that: > ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt This is not true as the previous code called: ```java public static byte[] encodeBase64(byte[] binaryData) ``` Which states: > Encodes binary data using the base64 algorithm but does not chunk the output. However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test suite. ### Was this patch authored or co-authored using generative AI tooling? No Closes #47303 from wForget/SPARK-47307. Lead-authored-by: Ted Jenks <tedcj@palantir.com> Co-authored-by: wforget <643348094@qq.com> Co-authored-by: Kent Yao <yao@apache.org> Co-authored-by: Ted Chester Jenks <tedcj@palantir.com> Signed-off-by: Kent Yao <yao@apache.org>

yaooqinn · 2024-07-12T10:41:38Z

Closed via #47303

Follow up apache#45408 ### What changes were proposed in this pull request? [[SPARK-47307](https://issues.apache.org/jira/browse/SPARK-47307)] Add a config to optionally chunk base64 strings ### Why are the changes needed? In apache#35110, it was incorrectly asserted that: > ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt This is not true as the previous code called: ```java public static byte[] encodeBase64(byte[] binaryData) ``` Which states: > Encodes binary data using the base64 algorithm but does not chunk the output. However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test suite. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47303 from wForget/SPARK-47307. Lead-authored-by: Ted Jenks <tedcj@palantir.com> Co-authored-by: wforget <643348094@qq.com> Co-authored-by: Kent Yao <yao@apache.org> Co-authored-by: Ted Chester Jenks <tedcj@palantir.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit 8d3d4f9)

Follow up apache#45408 ### What changes were proposed in this pull request? [[SPARK-47307](https://issues.apache.org/jira/browse/SPARK-47307)] Add a config to optionally chunk base64 strings ### Why are the changes needed? In apache#35110, it was incorrectly asserted that: > ApacheCommonBase64 obeys http://www.ietf.org/rfc/rfc2045.txt This is not true as the previous code called: ```java public static byte[] encodeBase64(byte[] binaryData) ``` Which states: > Encodes binary data using the base64 algorithm but does not chunk the output. However, the RFC 2045 (MIME) base64 encoder does chunk by default. This now means that any Spark encoded base64 strings cannot be decoded by encoders that do not implement RFC 2045. The docs state RFC 4648. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test suite. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47303 from wForget/SPARK-47307. Lead-authored-by: Ted Jenks <tedcj@palantir.com> Co-authored-by: wforget <643348094@qq.com> Co-authored-by: Kent Yao <yao@apache.org> Co-authored-by: Ted Chester Jenks <tedcj@palantir.com> Signed-off-by: Kent Yao <yao@apache.org>

Ted Jenks added 2 commits March 6, 2024 18:19

move to RFC4648

15c0094

remove others

b2019e9

github-actions bot added the SQL label Mar 6, 2024

do decoders too

c2e9605

add test

81621b1

fix upstream

2bd8b89

Ted Jenks added 2 commits March 7, 2024 08:02

mime to decode

a36161a

scala array

4016eb3

make it configurable

109b514

cloud-fan reviewed Mar 8, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala Outdated Show resolved Hide resolved

Ted Jenks added 4 commits March 8, 2024 12:45

fix the codegen

946af79

reorg

0dadce5

undo bad diff

55b6bf2

try that

8c30869

cloud-fan reviewed Mar 8, 2024

View reviewed changes

do chunk if outside codegen

ac92c02

dongjoon-hyun reviewed Mar 9, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Mar 9, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Mar 9, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala Show resolved Hide resolved

dongjoon-hyun changed the title ~~[SPARK-47307] Replace RFC 2045 base64 encoder with RFC 4648 encoder~~ [SPARK-47307][SQL] Replace RFC 2045 base64 encoder with RFC 4648 encoder Mar 9, 2024

ted-jenks changed the title ~~[SPARK-47307][SQL] Replace RFC 2045 base64 encoder with RFC 4648 encoder~~ [SPARK-47307][SQL] Add a config to optionally chunk base64 strings Mar 11, 2024

comments

3f854f7

do that

ef05b34

to gcp

5435698

dongjoon-hyun reviewed Mar 12, 2024

View reviewed changes

wForget mentioned this pull request Jul 11, 2024

[SPARK-47307][SQL] Add a config to optionally chunk base64 strings #47303

Closed

yaooqinn closed this Jul 12, 2024

[SPARK-47307][SQL] Add a config to optionally chunk base64 strings #45408

[SPARK-47307][SQL] Add a config to optionally chunk base64 strings #45408

Uh oh!

Conversation

ted-jenks commented Mar 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

ted-jenks commented Mar 6, 2024

Uh oh!

dongjoon-hyun commented Mar 6, 2024

Uh oh!

ted-jenks commented Mar 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Mar 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yaooqinn commented Mar 8, 2024

Uh oh!

dongjoon-hyun commented Mar 8, 2024

Uh oh!

yaooqinn commented Mar 8, 2024

Uh oh!

ted-jenks commented Mar 8, 2024

Uh oh!

Uh oh!

cloud-fan Mar 8, 2024

Choose a reason for hiding this comment

Uh oh!

ted-jenks Mar 8, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yaooqinn commented Mar 11, 2024

Uh oh!

ted-jenks commented Mar 11, 2024

Uh oh!

yaooqinn commented Mar 11, 2024

Uh oh!

ted-jenks commented Mar 11, 2024

Uh oh!

dongjoon-hyun commented Mar 11, 2024

Uh oh!

dongjoon-hyun Mar 12, 2024

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Mar 12, 2024

Choose a reason for hiding this comment

Uh oh!

yaooqinn commented Mar 20, 2024

Uh oh!

dongjoon-hyun commented May 7, 2024

Uh oh!

yaooqinn commented Jul 11, 2024

Uh oh!

yaooqinn commented Jul 11, 2024

Uh oh!

wForget commented Jul 11, 2024

Uh oh!

yaooqinn commented Jul 12, 2024

Uh oh!

Uh oh!

ted-jenks commented Mar 6, 2024 •

edited

Loading

ted-jenks commented Mar 7, 2024 •

edited

Loading

dongjoon-hyun commented Mar 7, 2024 •

edited

Loading