Skip to content

Conversation

@yew1eb
Copy link
Contributor

@yew1eb yew1eb commented Dec 1, 2025

Which issue does this PR close?

Closes #1680 .

Rationale for this change

The current initcap implementation uses DataFusion's initcap, which does not match Spark's semantics. Spark uses space-only word boundaries and title-cases the first letter while lowercasing the rest.

What changes are included in this PR?

  • Implement a new initcap native function aligned with Spark, similar to Spark's implementation logic: string.asInstanceOf[UTF8String].toLowerCase.toTitleCase.
  • Refactor and expand initcap unit tests, adding corner cases.

Are there any user-facing changes?

Yes. initcap results will now match Spark's semantics.

How was this patch tested?

Added unit tests covering ASCII/non-ASCII, punctuation, space-only boundaries, and edge cases.

@yew1eb
Copy link
Contributor Author

yew1eb commented Dec 1, 2025

@slfan1989 @richox could you please take a look?

ColumnarValue::Scalar(ScalarValue::Utf8(Some(str))) => {
Ok(ColumnarValue::Scalar(ScalarValue::Utf8(Some(initcap(str)))))
}
_ => df_execution_err!("string_initcap only supports literal utf8"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about "string_initcap only accepts string input"?

("select initcap(null)", Row(null))).foreach { case (q, expected) =>
checkAnswer(sql(q), Seq(expected))
withTable("initcap_basic_tbl") {
sql(s"CREATE TABLE initcap_basic_tbl(id INT, txt STRING) USING parquet")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would better not to use parquet table to do testing, the test would fail on second run because it would create a directory on local disk, on second run, the directory name is already taken.

Use temp view or something similar would be better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great suggestion. We'll fix this in a follow-up PR.

ColumnarValue::Scalar(ScalarValue::Utf8(Some(str))) => {
Ok(ColumnarValue::Scalar(ScalarValue::Utf8(Some(initcap(str)))))
}
_ => df_execution_err!("string_initcap only supports literal utf8"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, obviously it supports both literal and non-literal string inputs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated. PTAL

@slfan1989
Copy link
Contributor

@yew1eb Thank you very much for flagging this issue. I'm +1 on this change.

cc: @richox

sql(s"""
|INSERT INTO initcap_mixed_tbl VALUES
| (1, 'a1b2 c3D4'),
| (2, '---abc--- ABC --ABC-- 世界 世 界 '),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that! I think it would be better to remove the Chinese test cases here and use English instead. This way, other team members can more easily understand and verify the code during review. Thanks for understanding!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is ok to use chinese (or any other non-ascii characters) because we have to test with real unicode strings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to @richox for the feedback. From the perspective of Unicode testing, I believe this is acceptable.

@yew1eb yew1eb force-pushed the impl_spark_initcap branch from 096fbcd to a4d206e Compare December 8, 2025 16:31
@cxzl25 cxzl25 merged commit eeac189 into apache:master Dec 16, 2025
98 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

initCap behavior mismatches Spark semantics

5 participants