-
Notifications
You must be signed in to change notification settings - Fork 199
[AURON #1680] initCap semantics are aligned with Spark #1681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@slfan1989 @richox could you please take a look? |
| ColumnarValue::Scalar(ScalarValue::Utf8(Some(str))) => { | ||
| Ok(ColumnarValue::Scalar(ScalarValue::Utf8(Some(initcap(str))))) | ||
| } | ||
| _ => df_execution_err!("string_initcap only supports literal utf8"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about "string_initcap only accepts string input"?
| ("select initcap(null)", Row(null))).foreach { case (q, expected) => | ||
| checkAnswer(sql(q), Seq(expected)) | ||
| withTable("initcap_basic_tbl") { | ||
| sql(s"CREATE TABLE initcap_basic_tbl(id INT, txt STRING) USING parquet") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would better not to use parquet table to do testing, the test would fail on second run because it would create a directory on local disk, on second run, the directory name is already taken.
Use temp view or something similar would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great suggestion. We'll fix this in a follow-up PR.
| ColumnarValue::Scalar(ScalarValue::Utf8(Some(str))) => { | ||
| Ok(ColumnarValue::Scalar(ScalarValue::Utf8(Some(initcap(str))))) | ||
| } | ||
| _ => df_execution_err!("string_initcap only supports literal utf8"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, obviously it supports both literal and non-literal string inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated. PTAL
| sql(s""" | ||
| |INSERT INTO initcap_mixed_tbl VALUES | ||
| | (1, 'a1b2 c3D4'), | ||
| | (2, '---abc--- ABC --ABC-- 世界 世 界 '), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about that! I think it would be better to remove the Chinese test cases here and use English instead. This way, other team members can more easily understand and verify the code during review. Thanks for understanding!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is ok to use chinese (or any other non-ascii characters) because we have to test with real unicode strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks to @richox for the feedback. From the perspective of Unicode testing, I believe this is acceptable.
096fbcd to
a4d206e
Compare
Which issue does this PR close?
Closes #1680 .
Rationale for this change
The current initcap implementation uses DataFusion's initcap, which does not match Spark's semantics. Spark uses space-only word boundaries and title-cases the first letter while lowercasing the rest.
What changes are included in this PR?
string.asInstanceOf[UTF8String].toLowerCase.toTitleCase.Are there any user-facing changes?
Yes. initcap results will now match Spark's semantics.
How was this patch tested?
Added unit tests covering ASCII/non-ASCII, punctuation, space-only boundaries, and edge cases.