[SPARK-48498][SQL][3.5] Always do char padding in predicates #46958

jackylee-ch · 2024-06-12T12:58:57Z

What changes were proposed in this pull request?

For some data sources, CHAR type padding is not applied on both the write and read sides (by disabling spark.sql.readSideCharPadding), as a different SQL flavor, which is similar to MySQL: https://dev.mysql.com/doc/refman/8.0/en/char.html

However, there is a bug in Spark that we always pad the string literal when comparing CHAR type and STRING literals, which assumes the CHAR type columns are always padded, either on the write side or read side. This is not always true.

This PR makes Spark always pad the CHAR type columns when comparing with string literals, to satisfy the CHAR type semantic.

Why are the changes needed?

bug fix if people disable read side char padding

Does this PR introduce any user-facing change?

Yes. After this PR, comparing CHAR type with STRING literals follows the CHAR semantic, while before it mostly returns false.

How was this patch tested?

new tests

Was this patch authored or co-authored using generative AI tooling?

no

### What changes were proposed in this pull request? For some data sources, CHAR type padding is not applied on both the write and read sides (by disabling `spark.sql.readSideCharPadding`), as a different SQL flavor, which is similar to MySQL: https://dev.mysql.com/doc/refman/8.0/en/char.html However, there is a bug in Spark that we always pad the string literal when comparing CHAR type and STRING literals, which assumes the CHAR type columns are always padded, either on the write side or read side. This is not always true. This PR makes Spark always pad the CHAR type columns when comparing with string literals, to satisfy the CHAR type semantic. ### Why are the changes needed? bug fix if people disable read side char padding ### Does this PR introduce _any_ user-facing change? Yes. After this PR, comparing CHAR type with STRING literals follows the CHAR semantic, while before it mostly returns false. ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#46832 from cloud-fan/char. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

jackylee-ch · 2024-06-12T13:00:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+    .internal()
+    .doc("When true, Spark will not apply char type padding for CHAR type columns in string " +
+      s"comparison predicates, when '${READ_SIDE_CHAR_PADDING.key}' is false.")
+    .version("4.0.0")


@cloud-fan Any idea about how should we deal with the started version for this config since we backport it to 3.5?

I think we can change it to 3.5.2 here

unigof · 2024-06-17T08:22:57Z

sql/core/src/test/scala/org/apache/spark/sql/CharVarcharTestSuite.scala

+
+  test("SPARK-48498: always do char padding in predicates") {
+    import testImplicits._
+    withSQLConf(SQLConf.READ_SIDE_CHAR_PADDING.key -> "false") {


cloud you add a test for conf LEGACY_NO_CHAR_PADDING_IN_PREDICATE?
It maybe would fail when LEGACY_NO_CHAR_PADDING_IN_PREDICATE = true
https://github.com/apache/spark/pull/46832/files?diff=split&w=1#r1642368622

It will fail, as the legacy behavior is wrong. This is why we make this fix...

got it, thanks~

cloud-fan · 2024-06-18T00:25:33Z

@jackylee-ch can you rebase and re-trigger the test?

…_padding_fix_to_3.5

cloud-fan · 2024-06-19T06:39:27Z

3.5 is broken and will be fixed by #47022 . Please rebase after the fix is merged

…_padding_fix_to_3.5

jackylee-ch · 2024-06-24T06:44:32Z

3.5 is broken and will be fixed by #47022 . Please rebase after the fix is merged

Done. It is ready to be merged now. @cloud-fan

cloud-fan · 2024-06-24T07:56:56Z

thanks, merging to 3.5!

### What changes were proposed in this pull request? For some data sources, CHAR type padding is not applied on both the write and read sides (by disabling `spark.sql.readSideCharPadding`), as a different SQL flavor, which is similar to MySQL: https://dev.mysql.com/doc/refman/8.0/en/char.html However, there is a bug in Spark that we always pad the string literal when comparing CHAR type and STRING literals, which assumes the CHAR type columns are always padded, either on the write side or read side. This is not always true. This PR makes Spark always pad the CHAR type columns when comparing with string literals, to satisfy the CHAR type semantic. ### Why are the changes needed? bug fix if people disable read side char padding ### Does this PR introduce _any_ user-facing change? Yes. After this PR, comparing CHAR type with STRING literals follows the CHAR semantic, while before it mostly returns false. ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #46958 from jackylee-ch/backport_char_padding_fix_to_3.5. Lead-authored-by: jackylee-ch <lijunqing@baidu.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

github-actions bot added the SQL label Jun 12, 2024

jackylee-ch commented Jun 12, 2024

View reviewed changes

change the config version

373d080

unigof reviewed Jun 17, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/branch-3.5' into backport_char…

9c58e85

…_padding_fix_to_3.5

Merge remote-tracking branch 'upstream/branch-3.5' into backport_char…

28ca61a

…_padding_fix_to_3.5

cloud-fan approved these changes Jun 24, 2024

View reviewed changes

cloud-fan closed this Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-48498][SQL][3.5] Always do char padding in predicates #46958

[SPARK-48498][SQL][3.5] Always do char padding in predicates #46958

Uh oh!

jackylee-ch commented Jun 12, 2024

Uh oh!

jackylee-ch Jun 12, 2024

Uh oh!

cloud-fan Jun 13, 2024

Uh oh!

unigof Jun 17, 2024

Uh oh!

cloud-fan Jun 18, 2024

Uh oh!

unigof Jun 18, 2024

Uh oh!

cloud-fan commented Jun 18, 2024

Uh oh!

cloud-fan commented Jun 19, 2024

Uh oh!

jackylee-ch commented Jun 24, 2024

Uh oh!

cloud-fan commented Jun 24, 2024

Uh oh!

Uh oh!

[SPARK-48498][SQL][3.5] Always do char padding in predicates #46958

[SPARK-48498][SQL][3.5] Always do char padding in predicates #46958

Uh oh!

Conversation

jackylee-ch commented Jun 12, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

jackylee-ch Jun 12, 2024

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jun 13, 2024

Choose a reason for hiding this comment

Uh oh!

unigof Jun 17, 2024

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jun 18, 2024

Choose a reason for hiding this comment

Uh oh!

unigof Jun 18, 2024

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jun 18, 2024

Uh oh!

cloud-fan commented Jun 19, 2024

Uh oh!

jackylee-ch commented Jun 24, 2024

Uh oh!

cloud-fan commented Jun 24, 2024

Uh oh!

Uh oh!