[SPARK-48658][SQL] Encode/Decode functions report coding errors instead of mojibake for unmappable characters #47017

yaooqinn · 2024-06-19T02:53:02Z

What changes were proposed in this pull request?

This PR makes encode/decode functions report coding errors instead of mojibake for unmappable characters, take select encode('渭城朝雨浥轻尘', 'US-ASCII') as an example

Before this PR，

???????

After this PR，

org.apache.spark.SparkRuntimeException
{
  "errorClass" : "MALFORMED_CHARACTER_CODING",
  "sqlState" : "22000",
  "messageParameters" : {
    "charset" : "US-ASCII",
    "function" : "`encode`"
  }
}

Why are the changes needed?

Improve data quality.

Does this PR introduce any user-facing change?

Yes.

When set spark.sql.legacy.codingErrorAction to true, encode/decode functions replace unmappable characters with mojibake instead of reporting coding errors.

How was this patch tested?

new unit tests

Was this patch authored or co-authored using generative AI tooling?

no

…d of mojibake

yaooqinn · 2024-06-19T08:13:49Z

cc @cloud-fan @dongjoon-hyun @LuciferYang @HyukjinKwon thanks

connector/connect/common/src/test/resources/query-tests/explain-results/function_decode.explain

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala

yaooqinn · 2024-06-21T02:06:41Z

Hi @cloud-fan, I have addressed your comments. The expressions are now replaced at runtime by static invoke, and the string representations no longer contain those legacy flags.

cloud-fan · 2024-06-24T08:31:19Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CodeGenerationSuite.scala

@@ -107,7 +107,7 @@ class CodeGenerationSuite extends SparkFunSuite with ExpressionEvalHelper {
      strExpr = StringDecode(Encode(strExpr, "utf-8"), "utf-8")


can we use a different expression for testing? The codegen size is greatly decreased after using StaticInvoke in Encode.

e.g. StringTrim

Nice catch!

HyukjinKwon · 2024-06-24T08:44:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      "instead of reporting coding errors.")
+    .version("4.0.0")
+    .booleanConf
+    .createWithDefault(false)


I wonder if it should be a fallback conf to ANSI.

The reasons I'd like to make it independent of ANSI are:

Part of the implication of ANSI is Hive-incompatibility,

Hive also reports coding errors, so it was a mistake when we ported this from hive

These functions are not ANSI-defined

The error behaviors are also not found in ANSI

The reasons mentioned above indicate that this behavior is more of a legacy trait of Spark itself.

yaooqinn · 2024-06-24T12:42:23Z

Merged to master.

Thank you @cloud-fan @HyukjinKwon for the help

…ad of mojibake for unmappable characters ### What changes were proposed in this pull request? This PR makes encode/decode functions report coding errors instead of mojibake for unmappable characters, take `select encode('渭城朝雨浥轻尘', 'US-ASCII')` as an example Before this PR， ```sql ??????? ``` After this PR， ```json org.apache.spark.SparkRuntimeException { "errorClass" : "MALFORMED_CHARACTER_CODING", "sqlState" : "22000", "messageParameters" : { "charset" : "US-ASCII", "function" : "`encode`" } } ``` ### Why are the changes needed? Improve data quality. ### Does this PR introduce _any_ user-facing change? Yes. When set spark.sql.legacy.codingErrorAction to true, encode/decode functions replace unmappable characters with mojibake instead of reporting coding errors. ### How was this patch tested? new unit tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47017 from yaooqinn/SPARK-48658. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>

[SPARK-48658][SQL] Encode/Decode functions report coding error instea…

6426fdd

…d of mojibake

github-actions bot added the SQL label Jun 19, 2024

yaooqinn added 5 commits June 19, 2024 11:10

[SPARK-48658][SQL] Encode/Decode functions report coding error instea…

aee78a5

…d of mojibake

[SPARK-48658][SQL] Encode/Decode functions report coding error instea…

afb2d08

…d of mojibake

fix ExplainSuite

f6dd4fa

fix golden file tests

851135c

fix golden file tests

d3473a4

github-actions bot added the CONNECT label Jun 19, 2024

cloud-fan reviewed Jun 19, 2024

View reviewed changes

connector/connect/common/src/test/resources/query-tests/explain-results/function_decode.explain Outdated Show resolved Hide resolved

cloud-fan reviewed Jun 19, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala Outdated Show resolved Hide resolved

yaooqinn added 5 commits June 20, 2024 13:59

Encode RuntimeReplaceable with StaticInvoke

3e90976

Decode RuntimeReplaceable with StaticInvoke

b0cf6ee

fix tests

64f3c39

Merge branch 'master' into SPARK-48658

9d2583c

fix

8f5a236

yaooqinn requested a review from cloud-fan June 24, 2024 03:35

cloud-fan reviewed Jun 24, 2024

View reviewed changes

cloud-fan approved these changes Jun 24, 2024

View reviewed changes

HyukjinKwon reviewed Jun 24, 2024

View reviewed changes

address comments

d7a4199

yaooqinn closed this in fb5697d Jun 24, 2024

yaooqinn deleted the SPARK-48658 branch June 25, 2024 08:12

wForget mentioned this pull request Jul 13, 2024

[SPARK-47307][SQL][3.5] Add a config to optionally chunk base64 strings #47325

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-48658][SQL] Encode/Decode functions report coding errors instead of mojibake for unmappable characters #47017

[SPARK-48658][SQL] Encode/Decode functions report coding errors instead of mojibake for unmappable characters #47017

Uh oh!

yaooqinn commented Jun 19, 2024

Uh oh!

yaooqinn commented Jun 19, 2024

Uh oh!

Uh oh!

Uh oh!

yaooqinn commented Jun 21, 2024

Uh oh!

cloud-fan Jun 24, 2024

Uh oh!

cloud-fan Jun 24, 2024

Uh oh!

yaooqinn Jun 24, 2024

Uh oh!

HyukjinKwon Jun 24, 2024

Uh oh!

yaooqinn Jun 24, 2024

Uh oh!

yaooqinn commented Jun 24, 2024

Uh oh!

Uh oh!

		@@ -107,7 +107,7 @@ class CodeGenerationSuite extends SparkFunSuite with ExpressionEvalHelper {
		strExpr = StringDecode(Encode(strExpr, "utf-8"), "utf-8")

[SPARK-48658][SQL] Encode/Decode functions report coding errors instead of mojibake for unmappable characters #47017

[SPARK-48658][SQL] Encode/Decode functions report coding errors instead of mojibake for unmappable characters #47017

Uh oh!

Conversation

yaooqinn commented Jun 19, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

yaooqinn commented Jun 19, 2024

Uh oh!

Uh oh!

Uh oh!

yaooqinn commented Jun 21, 2024

Uh oh!

cloud-fan Jun 24, 2024

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jun 24, 2024

Choose a reason for hiding this comment

Uh oh!

yaooqinn Jun 24, 2024

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jun 24, 2024

Choose a reason for hiding this comment

Uh oh!

yaooqinn Jun 24, 2024

Choose a reason for hiding this comment

Uh oh!

yaooqinn commented Jun 24, 2024

Uh oh!

Uh oh!