-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-48658][SQL] Encode/Decode functions report coding errors instead of mojibake for unmappable characters #47017
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cc @cloud-fan @dongjoon-hyun @LuciferYang @HyukjinKwon thanks |
connector/connect/common/src/test/resources/query-tests/explain-results/function_decode.explain
Outdated
Show resolved
Hide resolved
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala
Outdated
Show resolved
Hide resolved
Hi @cloud-fan, I have addressed your comments. The expressions are now replaced at runtime by static invoke, and the string representations no longer contain those legacy flags. |
@@ -107,7 +107,7 @@ class CodeGenerationSuite extends SparkFunSuite with ExpressionEvalHelper { | |||
strExpr = StringDecode(Encode(strExpr, "utf-8"), "utf-8") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use a different expression for testing? The codegen size is greatly decreased after using StaticInvoke
in Encode
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. StringTrim
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch!
"instead of reporting coding errors.") | ||
.version("4.0.0") | ||
.booleanConf | ||
.createWithDefault(false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it should be a fallback conf to ANSI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reasons I'd like to make it independent of ANSI are:
- Part of the implication of ANSI is Hive-incompatibility,
- Hive also reports coding errors, so it was a mistake when we ported this from hive
- These functions are not ANSI-defined
- The error behaviors are also not found in ANSI
The reasons mentioned above indicate that this behavior is more of a legacy trait of Spark itself.
Merged to master. Thank you @cloud-fan @HyukjinKwon for the help |
…ad of mojibake for unmappable characters ### What changes were proposed in this pull request? This PR makes encode/decode functions report coding errors instead of mojibake for unmappable characters, take `select encode('渭城朝雨浥轻尘', 'US-ASCII')` as an example Before this PR, ```sql ??????? ``` After this PR, ```json org.apache.spark.SparkRuntimeException { "errorClass" : "MALFORMED_CHARACTER_CODING", "sqlState" : "22000", "messageParameters" : { "charset" : "US-ASCII", "function" : "`encode`" } } ``` ### Why are the changes needed? Improve data quality. ### Does this PR introduce _any_ user-facing change? Yes. When set spark.sql.legacy.codingErrorAction to true, encode/decode functions replace unmappable characters with mojibake instead of reporting coding errors. ### How was this patch tested? new unit tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47017 from yaooqinn/SPARK-48658. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org>
What changes were proposed in this pull request?
This PR makes encode/decode functions report coding errors instead of mojibake for unmappable characters, take
select encode('渭城朝雨浥轻尘', 'US-ASCII')
as an exampleBefore this PR,
After this PR,
Why are the changes needed?
Improve data quality.
Does this PR introduce any user-facing change?
Yes.
When set spark.sql.legacy.codingErrorAction to true, encode/decode functions replace unmappable characters with mojibake instead of reporting coding errors.
How was this patch tested?
new unit tests
Was this patch authored or co-authored using generative AI tooling?
no