Skip to content

[SPARK-48658][SQL] Encode/Decode functions report coding errors instead of mojibake for unmappable characters #47017

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 12 commits into from

Conversation

yaooqinn
Copy link
Member

What changes were proposed in this pull request?

This PR makes encode/decode functions report coding errors instead of mojibake for unmappable characters, take select encode('渭城朝雨浥轻尘', 'US-ASCII') as an example

Before this PR,

???????

After this PR,

org.apache.spark.SparkRuntimeException
{
  "errorClass" : "MALFORMED_CHARACTER_CODING",
  "sqlState" : "22000",
  "messageParameters" : {
    "charset" : "US-ASCII",
    "function" : "`encode`"
  }
}

Why are the changes needed?

Improve data quality.

Does this PR introduce any user-facing change?

Yes.

When set spark.sql.legacy.codingErrorAction to true, encode/decode functions replace unmappable characters with mojibake instead of reporting coding errors.

How was this patch tested?

new unit tests

Was this patch authored or co-authored using generative AI tooling?

no

@github-actions github-actions bot added the SQL label Jun 19, 2024
@yaooqinn
Copy link
Member Author

@yaooqinn
Copy link
Member Author

Hi @cloud-fan, I have addressed your comments. The expressions are now replaced at runtime by static invoke, and the string representations no longer contain those legacy flags.

@yaooqinn yaooqinn requested a review from cloud-fan June 24, 2024 03:35
@@ -107,7 +107,7 @@ class CodeGenerationSuite extends SparkFunSuite with ExpressionEvalHelper {
strExpr = StringDecode(Encode(strExpr, "utf-8"), "utf-8")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use a different expression for testing? The codegen size is greatly decreased after using StaticInvoke in Encode.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. StringTrim

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

"instead of reporting coding errors.")
.version("4.0.0")
.booleanConf
.createWithDefault(false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it should be a fallback conf to ANSI.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reasons I'd like to make it independent of ANSI are:

  • Part of the implication of ANSI is Hive-incompatibility,
  • Hive also reports coding errors, so it was a mistake when we ported this from hive
  • These functions are not ANSI-defined
  • The error behaviors are also not found in ANSI

The reasons mentioned above indicate that this behavior is more of a legacy trait of Spark itself.

@yaooqinn yaooqinn closed this in fb5697d Jun 24, 2024
@yaooqinn
Copy link
Member Author

Merged to master.

Thank you @cloud-fan @HyukjinKwon for the help

@yaooqinn yaooqinn deleted the SPARK-48658 branch June 25, 2024 08:12
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
…ad of mojibake for unmappable characters

### What changes were proposed in this pull request?
This PR makes encode/decode functions report coding errors instead of mojibake for unmappable characters, take `select encode('渭城朝雨浥轻尘', 'US-ASCII')` as an example

Before this PR,

```sql
???????
```
After this PR,
```json
org.apache.spark.SparkRuntimeException
{
  "errorClass" : "MALFORMED_CHARACTER_CODING",
  "sqlState" : "22000",
  "messageParameters" : {
    "charset" : "US-ASCII",
    "function" : "`encode`"
  }
}
```

### Why are the changes needed?

Improve data quality.

### Does this PR introduce _any_ user-facing change?

Yes.

When set spark.sql.legacy.codingErrorAction to true, encode/decode functions replace unmappable characters with mojibake instead of reporting coding errors.

### How was this patch tested?

new unit tests
### Was this patch authored or co-authored using generative AI tooling?
no

Closes apache#47017 from yaooqinn/SPARK-48658.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Kent Yao <yao@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants