-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-48498][SQL] Always do char padding in predicates #46832
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cc @yaooqinn |
@@ -4603,6 +4603,14 @@ object SQLConf { | |||
.booleanConf | |||
.createWithDefault(true) | |||
|
|||
val LEGACY_NO_CHAR_PADDING_IN_PREDICATE = buildConf("spark.sql.legacy.noCharPaddingInPredicate") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the legacy behavior is always padding, how about LEGACY_CHAR_PADDING_IN_PREDICATE
and change the default value to true ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
legacy behavior is no padding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, there is a bug in Spark that we always pad the string literal when comparing CHAR type and STRING literals, which assumes the CHAR type columns are always padded, either on the write side or read side.
I'm a bit confused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me make it clear. It was talking about when people disabling spark.sql.readSideCharPadding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
.internal() | ||
.doc("When true, Spark will not apply char type padding for CHAR type columns in string " + | ||
s"comparison predicates, when '${READ_SIDE_CHAR_PADDING.key}' is false.") | ||
.version("4.0.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: Is it needed by 3.5 or earlier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally yes, but it's only a problem if people turn off READ_SIDE_CHAR_PADDING, so not a bug by default, and I don't have a strong preference on backporting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We met SPARK-48562 in Spark 3.5, which is a bug for ApplyCharTypePadding
while writing JDBC temporary views, it lead us turn off READ_SIDE_CHAR_PADDING while creating JDBC temp views. Thus, we need backport this pr or filx the SPARK-48562 in 3.5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Feel free to create a backport PR and I can help merge it.
thanks for review, merging to master! |
### What changes were proposed in this pull request? For some data sources, CHAR type padding is not applied on both the write and read sides (by disabling `spark.sql.readSideCharPadding`), as a different SQL flavor, which is similar to MySQL: https://dev.mysql.com/doc/refman/8.0/en/char.html However, there is a bug in Spark that we always pad the string literal when comparing CHAR type and STRING literals, which assumes the CHAR type columns are always padded, either on the write side or read side. This is not always true. This PR makes Spark always pad the CHAR type columns when comparing with string literals, to satisfy the CHAR type semantic. ### Why are the changes needed? bug fix if people disable read side char padding ### Does this PR introduce _any_ user-facing change? Yes. After this PR, comparing CHAR type with STRING literals follows the CHAR semantic, while before it mostly returns false. ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#46832 from cloud-fan/char. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
|
||
test("SPARK-48498: always do char padding in predicates") { | ||
import testImplicits._ | ||
withSQLConf(SQLConf.READ_SIDE_CHAR_PADDING.key -> "false") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If set key LEGACY_NO_CHAR_PADDING_IN_PREDICATE
= true, this case will fail.
if not set this key to true, maybe can not get data when char column compare to liter, because liter would be padded but char column not, any idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LEGACY_NO_CHAR_PADDING_IN_PREDICATE
means no padding, and this test will fail if this config is true as this test wants padding.
### What changes were proposed in this pull request? This is a followup of #46832 to handle a missing case: char-char comparison. We should pad both sides if `READ_SIDE_CHAR_PADDING` is not enabled. ### Why are the changes needed? bug fix if people disable read side char padding ### Does this PR introduce _any_ user-facing change? No because it's a followup and the original PR is not released yet ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #47412 from cloud-fan/char. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? This is a followup of #46832 to handle a missing case: char-char comparison. We should pad both sides if `READ_SIDE_CHAR_PADDING` is not enabled. ### Why are the changes needed? bug fix if people disable read side char padding ### Does this PR introduce _any_ user-facing change? No because it's a followup and the original PR is not released yet ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #47412 from cloud-fan/char. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? This is a followup of apache#46832 to handle a missing case: char-char comparison. We should pad both sides if `READ_SIDE_CHAR_PADDING` is not enabled. ### Why are the changes needed? bug fix if people disable read side char padding ### Does this PR introduce _any_ user-facing change? No because it's a followup and the original PR is not released yet ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47412 from cloud-fan/char. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? This is a followup of apache#46832 to handle a missing case: char-char comparison. We should pad both sides if `READ_SIDE_CHAR_PADDING` is not enabled. ### Why are the changes needed? bug fix if people disable read side char padding ### Does this PR introduce _any_ user-facing change? No because it's a followup and the original PR is not released yet ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47412 from cloud-fan/char. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request? This is a followup of apache#46832 to handle a missing case: char-char comparison. We should pad both sides if `READ_SIDE_CHAR_PADDING` is not enabled. ### Why are the changes needed? bug fix if people disable read side char padding ### Does this PR introduce _any_ user-facing change? No because it's a followup and the original PR is not released yet ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#47412 from cloud-fan/char. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
For some data sources, CHAR type padding is not applied on both the write and read sides (by disabling
spark.sql.readSideCharPadding
), as a different SQL flavor, which is similar to MySQL: https://dev.mysql.com/doc/refman/8.0/en/char.htmlHowever, there is a bug in Spark that we always pad the string literal when comparing CHAR type and STRING literals, which assumes the CHAR type columns are always padded, either on the write side or read side. This is not always true.
This PR makes Spark always pad the CHAR type columns when comparing with string literals, to satisfy the CHAR type semantic.
Why are the changes needed?
bug fix if people disable read side char padding
Does this PR introduce any user-facing change?
Yes. After this PR, comparing CHAR type with STRING literals follows the CHAR semantic, while before it mostly returns false.
How was this patch tested?
new tests
Was this patch authored or co-authored using generative AI tooling?
no