[SPARK-48498][SQL] Always do char padding in predicates #46832

cloud-fan · 2024-06-01T00:32:43Z

What changes were proposed in this pull request?

For some data sources, CHAR type padding is not applied on both the write and read sides (by disabling spark.sql.readSideCharPadding), as a different SQL flavor, which is similar to MySQL: https://dev.mysql.com/doc/refman/8.0/en/char.html

However, there is a bug in Spark that we always pad the string literal when comparing CHAR type and STRING literals, which assumes the CHAR type columns are always padded, either on the write side or read side. This is not always true.

This PR makes Spark always pad the CHAR type columns when comparing with string literals, to satisfy the CHAR type semantic.

Why are the changes needed?

bug fix if people disable read side char padding

Does this PR introduce any user-facing change?

Yes. After this PR, comparing CHAR type with STRING literals follows the CHAR semantic, while before it mostly returns false.

How was this patch tested?

new tests

Was this patch authored or co-authored using generative AI tooling?

no

cloud-fan · 2024-06-01T00:32:54Z

cc @yaooqinn

beliefer · 2024-06-03T07:10:11Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -4603,6 +4603,14 @@ object SQLConf {
    .booleanConf
    .createWithDefault(true)

+  val LEGACY_NO_CHAR_PADDING_IN_PREDICATE = buildConf("spark.sql.legacy.noCharPaddingInPredicate")


Because the legacy behavior is always padding, how about LEGACY_CHAR_PADDING_IN_PREDICATE and change the default value to true ?

legacy behavior is no padding.

However, there is a bug in Spark that we always pad the string literal when comparing CHAR type and STRING literals, which assumes the CHAR type columns are always padded, either on the write side or read side.
I'm a bit confused.

Let me make it clear. It was talking about when people disabling spark.sql.readSideCharPadding

Got it. Thank you.

beliefer

LGTM.

yaooqinn · 2024-06-05T08:26:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+    .internal()
+    .doc("When true, Spark will not apply char type padding for CHAR type columns in string " +
+      s"comparison predicates, when '${READ_SIDE_CHAR_PADDING.key}' is false.")
+    .version("4.0.0")


QQ: Is it needed by 3.5 or earlier?

ideally yes, but it's only a problem if people turn off READ_SIDE_CHAR_PADDING, so not a bug by default, and I don't have a strong preference on backporting.

We met SPARK-48562 in Spark 3.5, which is a bug for ApplyCharTypePadding while writing JDBC temporary views, it lead us turn off READ_SIDE_CHAR_PADDING while creating JDBC temp views. Thus, we need backport this pr or filx the SPARK-48562 in 3.5.

I see. Feel free to create a backport PR and I can help merge it.

cloud-fan · 2024-06-05T20:00:48Z

thanks for review, merging to master!

### What changes were proposed in this pull request? For some data sources, CHAR type padding is not applied on both the write and read sides (by disabling `spark.sql.readSideCharPadding`), as a different SQL flavor, which is similar to MySQL: https://dev.mysql.com/doc/refman/8.0/en/char.html However, there is a bug in Spark that we always pad the string literal when comparing CHAR type and STRING literals, which assumes the CHAR type columns are always padded, either on the write side or read side. This is not always true. This PR makes Spark always pad the CHAR type columns when comparing with string literals, to satisfy the CHAR type semantic. ### Why are the changes needed? bug fix if people disable read side char padding ### Does this PR introduce _any_ user-facing change? Yes. After this PR, comparing CHAR type with STRING literals follows the CHAR semantic, while before it mostly returns false. ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#46832 from cloud-fan/char. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

unigof · 2024-06-17T08:04:26Z

sql/core/src/test/scala/org/apache/spark/sql/CharVarcharTestSuite.scala

+
+  test("SPARK-48498: always do char padding in predicates") {
+    import testImplicits._
+    withSQLConf(SQLConf.READ_SIDE_CHAR_PADDING.key -> "false") {


If set key LEGACY_NO_CHAR_PADDING_IN_PREDICATE = true, this case will fail.

if not set this key to true, maybe can not get data when char column compare to liter, because liter would be padded but char column not, any idea?

LEGACY_NO_CHAR_PADDING_IN_PREDICATE means no padding, and this test will fail if this config is true as this test wants padding.

### What changes were proposed in this pull request? This is a followup of #46832 to handle a missing case: char-char comparison. We should pad both sides if `READ_SIDE_CHAR_PADDING` is not enabled. ### Why are the changes needed? bug fix if people disable read side char padding ### Does this PR introduce _any_ user-facing change? No because it's a followup and the original PR is not released yet ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #47412 from cloud-fan/char. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

### What changes were proposed in this pull request? This is a followup of apache#46832 to handle a missing case: char-char comparison. We should pad both sides if `READ_SIDE_CHAR_PADDING` is not enabled. ### Why are the changes needed? bug fix if people disable read side char padding ### Does this PR introduce _any_ user-facing change? No because it's a followup and the original PR is not released yet ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47412 from cloud-fan/char. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

github-actions bot added the SQL label Jun 1, 2024

yaooqinn approved these changes Jun 1, 2024

View reviewed changes

always do char padding in predicates

69a912e

cloud-fan force-pushed the char branch from 5c25760 to 69a912e Compare June 2, 2024 19:11

beliefer reviewed Jun 3, 2024

View reviewed changes

beliefer approved these changes Jun 5, 2024

View reviewed changes

yaooqinn reviewed Jun 5, 2024

View reviewed changes

yaooqinn approved these changes Jun 5, 2024

View reviewed changes

cloud-fan closed this in 490a4b3 Jun 5, 2024

unigof reviewed Jun 17, 2024

View reviewed changes

cloud-fan mentioned this pull request Jul 19, 2024

[SPARK-48498][SQL][FOLLOWUP] do padding for char-char comparison #47412

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-48498][SQL] Always do char padding in predicates #46832

[SPARK-48498][SQL] Always do char padding in predicates #46832

Uh oh!

cloud-fan commented Jun 1, 2024 •

edited

Loading

Uh oh!

cloud-fan commented Jun 1, 2024

Uh oh!

beliefer Jun 3, 2024

Uh oh!

cloud-fan Jun 3, 2024

Uh oh!

beliefer Jun 4, 2024

Uh oh!

cloud-fan Jun 4, 2024 •

edited

Loading

Uh oh!

beliefer Jun 5, 2024

Uh oh!

beliefer left a comment

Uh oh!

yaooqinn Jun 5, 2024

Uh oh!

cloud-fan Jun 5, 2024 •

edited

Loading

Uh oh!

jackylee-ch Jun 11, 2024

Uh oh!

cloud-fan Jun 11, 2024

Uh oh!

cloud-fan commented Jun 5, 2024

Uh oh!

unigof Jun 17, 2024

Uh oh!

cloud-fan Jun 17, 2024

Uh oh!

Uh oh!

[SPARK-48498][SQL] Always do char padding in predicates #46832

[SPARK-48498][SQL] Always do char padding in predicates #46832

Uh oh!

Conversation

cloud-fan commented Jun 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

cloud-fan commented Jun 1, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jun 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

beliefer left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jun 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jun 5, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cloud-fan commented Jun 1, 2024 •

edited

Loading

cloud-fan Jun 4, 2024 •

edited

Loading

cloud-fan Jun 5, 2024 •

edited

Loading