[AURON #1680] initCap semantics are aligned with Spark #1681

yew1eb · 2025-12-01T00:34:17Z

Which issue does this PR close?

Closes #1680 .

Rationale for this change

The current initcap implementation uses DataFusion's initcap, which does not match Spark's semantics. Spark uses space-only word boundaries and title-cases the first letter while lowercasing the rest.

What changes are included in this PR?

Implement a new initcap native function aligned with Spark, similar to Spark's implementation logic: string.asInstanceOf[UTF8String].toLowerCase.toTitleCase.
Refactor and expand initcap unit tests, adding corner cases.

Are there any user-facing changes?

Yes. initcap results will now match Spark's semantics.

How was this patch tested?

Added unit tests covering ASCII/non-ASCII, punctuation, space-only boundaries, and edge cases.

yew1eb · 2025-12-01T03:43:45Z

@slfan1989 @richox could you please take a look?

xumingming · 2025-12-01T10:54:22Z

native-engine/datafusion-ext-functions/src/spark_initcap.rs

+        ColumnarValue::Scalar(ScalarValue::Utf8(Some(str))) => {
+            Ok(ColumnarValue::Scalar(ScalarValue::Utf8(Some(initcap(str)))))
+        }
+        _ => df_execution_err!("string_initcap only supports literal utf8"),


how about "string_initcap only accepts string input"?

xumingming · 2025-12-01T11:00:49Z

spark-extension-shims-spark/src/test/scala/org.apache.auron/AuronQuerySuite.scala

-      ("select initcap(null)", Row(null))).foreach { case (q, expected) =>
-      checkAnswer(sql(q), Seq(expected))
+    withTable("initcap_basic_tbl") {
+      sql(s"CREATE TABLE initcap_basic_tbl(id INT, txt STRING) USING parquet")


It would better not to use parquet table to do testing, the test would fail on second run because it would create a directory on local disk, on second run, the directory name is already taken.

Use temp view or something similar would be better.

Great suggestion. We'll fix this in a follow-up PR.

richox · 2025-12-04T11:55:04Z

native-engine/datafusion-ext-functions/src/spark_initcap.rs

+        ColumnarValue::Scalar(ScalarValue::Utf8(Some(str))) => {
+            Ok(ColumnarValue::Scalar(ScalarValue::Utf8(Some(initcap(str)))))
+        }
+        _ => df_execution_err!("string_initcap only supports literal utf8"),


yes, obviously it supports both literal and non-literal string inputs.

good catch!

updated. PTAL

slfan1989 · 2025-12-08T03:09:54Z

@yew1eb Thank you very much for flagging this issue. I'm +1 on this change.

cc: @richox

slfan1989 · 2025-12-08T03:12:02Z

spark-extension-shims-spark/src/test/scala/org.apache.auron/AuronQuerySuite.scala

+      sql(s"""
+           |INSERT INTO initcap_mixed_tbl VALUES
+           | (1, 'a1b2 c3D4'),
+           | (2, '---abc--- ABC --ABC-- 世界 世 界 '),


Sorry about that! I think it would be better to remove the Chinese test cases here and use English instead. This way, other team members can more easily understand and verify the code during review. Thanks for understanding!

it is ok to use chinese (or any other non-ascii characters) because we have to test with real unicode strings.

Thanks to @richox for the feedback. From the perspective of Unicode testing, I believe this is acceptable.

github-actions bot added spark native labels Dec 1, 2025

xumingming reviewed Dec 1, 2025

View reviewed changes

richox reviewed Dec 4, 2025

View reviewed changes

slfan1989 reviewed Dec 8, 2025

View reviewed changes

yew1eb added 5 commits December 9, 2025 00:29

[AURON apache#1680] initCap semantics are aligned with Spark

c1f8340

up

a090957

up

ddacaec

up

88a745e

up

a4d206e

yew1eb force-pushed the impl_spark_initcap branch from 096fbcd to a4d206e Compare December 8, 2025 16:31

up

b3750ea

richox approved these changes Dec 15, 2025

View reviewed changes

slfan1989 approved these changes Dec 16, 2025

View reviewed changes

cxzl25 approved these changes Dec 16, 2025

View reviewed changes

cxzl25 merged commit eeac189 into apache:master Dec 16, 2025
98 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AURON #1680] initCap semantics are aligned with Spark #1681

[AURON #1680] initCap semantics are aligned with Spark #1681

yew1eb commented Dec 1, 2025 •

edited

Loading

Uh oh!

yew1eb commented Dec 1, 2025

Uh oh!

xumingming Dec 1, 2025

Uh oh!

xumingming Dec 1, 2025

Uh oh!

yew1eb Dec 4, 2025

Uh oh!

richox Dec 4, 2025

Uh oh!

yew1eb Dec 4, 2025

Uh oh!

yew1eb Dec 5, 2025

Uh oh!

slfan1989 commented Dec 8, 2025

Uh oh!

slfan1989 Dec 8, 2025

Uh oh!

richox Dec 15, 2025

Uh oh!

slfan1989 Dec 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[AURON #1680] initCap semantics are aligned with Spark #1681

[AURON #1680] initCap semantics are aligned with Spark #1681

Conversation

yew1eb commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

How was this patch tested?

Uh oh!

yew1eb commented Dec 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

slfan1989 commented Dec 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yew1eb commented Dec 1, 2025 •

edited

Loading