-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-48498][SQL][3.5] Always do char padding in predicates #46958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-48498][SQL][3.5] Always do char padding in predicates #46958
Conversation
### What changes were proposed in this pull request? For some data sources, CHAR type padding is not applied on both the write and read sides (by disabling `spark.sql.readSideCharPadding`), as a different SQL flavor, which is similar to MySQL: https://dev.mysql.com/doc/refman/8.0/en/char.html However, there is a bug in Spark that we always pad the string literal when comparing CHAR type and STRING literals, which assumes the CHAR type columns are always padded, either on the write side or read side. This is not always true. This PR makes Spark always pad the CHAR type columns when comparing with string literals, to satisfy the CHAR type semantic. ### Why are the changes needed? bug fix if people disable read side char padding ### Does this PR introduce _any_ user-facing change? Yes. After this PR, comparing CHAR type with STRING literals follows the CHAR semantic, while before it mostly returns false. ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#46832 from cloud-fan/char. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
.internal() | ||
.doc("When true, Spark will not apply char type padding for CHAR type columns in string " + | ||
s"comparison predicates, when '${READ_SIDE_CHAR_PADDING.key}' is false.") | ||
.version("4.0.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Any idea about how should we deal with the started version for this config since we backport it to 3.5?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can change it to 3.5.2
here
|
||
test("SPARK-48498: always do char padding in predicates") { | ||
import testImplicits._ | ||
withSQLConf(SQLConf.READ_SIDE_CHAR_PADDING.key -> "false") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cloud you add a test for conf LEGACY_NO_CHAR_PADDING_IN_PREDICATE
?
It maybe would fail when LEGACY_NO_CHAR_PADDING_IN_PREDICATE
= true
https://github.com/apache/spark/pull/46832/files?diff=split&w=1#r1642368622
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will fail, as the legacy behavior is wrong. This is why we make this fix...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, thanks~
@jackylee-ch can you rebase and re-trigger the test? |
…_padding_fix_to_3.5
3.5 is broken and will be fixed by #47022 . Please rebase after the fix is merged |
…_padding_fix_to_3.5
Done. It is ready to be merged now. @cloud-fan |
thanks, merging to 3.5! |
### What changes were proposed in this pull request? For some data sources, CHAR type padding is not applied on both the write and read sides (by disabling `spark.sql.readSideCharPadding`), as a different SQL flavor, which is similar to MySQL: https://dev.mysql.com/doc/refman/8.0/en/char.html However, there is a bug in Spark that we always pad the string literal when comparing CHAR type and STRING literals, which assumes the CHAR type columns are always padded, either on the write side or read side. This is not always true. This PR makes Spark always pad the CHAR type columns when comparing with string literals, to satisfy the CHAR type semantic. ### Why are the changes needed? bug fix if people disable read side char padding ### Does this PR introduce _any_ user-facing change? Yes. After this PR, comparing CHAR type with STRING literals follows the CHAR semantic, while before it mostly returns false. ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #46958 from jackylee-ch/backport_char_padding_fix_to_3.5. Lead-authored-by: jackylee-ch <lijunqing@baidu.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
For some data sources, CHAR type padding is not applied on both the write and read sides (by disabling
spark.sql.readSideCharPadding
), as a different SQL flavor, which is similar to MySQL: https://dev.mysql.com/doc/refman/8.0/en/char.htmlHowever, there is a bug in Spark that we always pad the string literal when comparing CHAR type and STRING literals, which assumes the CHAR type columns are always padded, either on the write side or read side. This is not always true.
This PR makes Spark always pad the CHAR type columns when comparing with string literals, to satisfy the CHAR type semantic.
Why are the changes needed?
bug fix if people disable read side char padding
Does this PR introduce any user-facing change?
Yes. After this PR, comparing CHAR type with STRING literals follows the CHAR semantic, while before it mostly returns false.
How was this patch tested?
new tests
Was this patch authored or co-authored using generative AI tooling?
no