Skip to content

[SPARK-48498][SQL][3.5] Always do char padding in predicates #46958

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

jackylee-ch
Copy link
Contributor

What changes were proposed in this pull request?

For some data sources, CHAR type padding is not applied on both the write and read sides (by disabling spark.sql.readSideCharPadding), as a different SQL flavor, which is similar to MySQL: https://dev.mysql.com/doc/refman/8.0/en/char.html

However, there is a bug in Spark that we always pad the string literal when comparing CHAR type and STRING literals, which assumes the CHAR type columns are always padded, either on the write side or read side. This is not always true.

This PR makes Spark always pad the CHAR type columns when comparing with string literals, to satisfy the CHAR type semantic.

Why are the changes needed?

bug fix if people disable read side char padding

Does this PR introduce any user-facing change?

Yes. After this PR, comparing CHAR type with STRING literals follows the CHAR semantic, while before it mostly returns false.

How was this patch tested?

new tests

Was this patch authored or co-authored using generative AI tooling?

no

### What changes were proposed in this pull request?

For some data sources, CHAR type padding is not applied on both the write and read sides (by disabling `spark.sql.readSideCharPadding`), as a different SQL flavor, which is similar to MySQL: https://dev.mysql.com/doc/refman/8.0/en/char.html

However, there is a bug in Spark that we always pad the string literal when comparing CHAR type and STRING literals, which assumes the CHAR type columns are always padded, either on the write side or read side. This is not always true.

This PR makes Spark always pad the CHAR type columns when comparing with string literals, to satisfy the CHAR type semantic.

### Why are the changes needed?

bug fix if people disable read side char padding

### Does this PR introduce _any_ user-facing change?

Yes. After this PR, comparing CHAR type with STRING literals follows the CHAR semantic, while before it mostly returns false.

### How was this patch tested?

new tests

### Was this patch authored or co-authored using generative AI tooling?

no

Closes apache#46832 from cloud-fan/char.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@github-actions github-actions bot added the SQL label Jun 12, 2024
.internal()
.doc("When true, Spark will not apply char type padding for CHAR type columns in string " +
s"comparison predicates, when '${READ_SIDE_CHAR_PADDING.key}' is false.")
.version("4.0.0")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Any idea about how should we deal with the started version for this config since we backport it to 3.5?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can change it to 3.5.2 here


test("SPARK-48498: always do char padding in predicates") {
import testImplicits._
withSQLConf(SQLConf.READ_SIDE_CHAR_PADDING.key -> "false") {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cloud you add a test for conf LEGACY_NO_CHAR_PADDING_IN_PREDICATE?
It maybe would fail when LEGACY_NO_CHAR_PADDING_IN_PREDICATE = true
https://github.com/apache/spark/pull/46832/files?diff=split&w=1#r1642368622

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will fail, as the legacy behavior is wrong. This is why we make this fix...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, thanks~

@cloud-fan
Copy link
Contributor

@jackylee-ch can you rebase and re-trigger the test?

@cloud-fan
Copy link
Contributor

3.5 is broken and will be fixed by #47022 . Please rebase after the fix is merged

@jackylee-ch
Copy link
Contributor Author

3.5 is broken and will be fixed by #47022 . Please rebase after the fix is merged

Done. It is ready to be merged now. @cloud-fan

@cloud-fan
Copy link
Contributor

thanks, merging to 3.5!

cloud-fan added a commit that referenced this pull request Jun 24, 2024
### What changes were proposed in this pull request?

For some data sources, CHAR type padding is not applied on both the write and read sides (by disabling `spark.sql.readSideCharPadding`), as a different SQL flavor, which is similar to MySQL: https://dev.mysql.com/doc/refman/8.0/en/char.html

However, there is a bug in Spark that we always pad the string literal when comparing CHAR type and STRING literals, which assumes the CHAR type columns are always padded, either on the write side or read side. This is not always true.

This PR makes Spark always pad the CHAR type columns when comparing with string literals, to satisfy the CHAR type semantic.

### Why are the changes needed?

bug fix if people disable read side char padding

### Does this PR introduce _any_ user-facing change?

Yes. After this PR, comparing CHAR type with STRING literals follows the CHAR semantic, while before it mostly returns false.

### How was this patch tested?

new tests

### Was this patch authored or co-authored using generative AI tooling?

no

Closes #46958 from jackylee-ch/backport_char_padding_fix_to_3.5.

Lead-authored-by: jackylee-ch <lijunqing@baidu.com>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@cloud-fan cloud-fan closed this Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants