-
Notifications
You must be signed in to change notification settings - Fork 199
Closed
Description
Summary
Add a native, vectorized initcap(str) that capitalizes the first letter of each word and lowercases the rest, matching Apache Spark semantics.
Motivation
- Align with Spark SQL string functions to improve coverage.
- Use an Arrow-native implementation to boost performance on large columns.
Scope
- New scalar function:
initcap(str) - Input types:
STRING(Utf8,LargeUtf8) - Output type:
STRING - Null semantics:
initcap(NULL) → NULL - Empty string returns empty string
Expected Semantics (Spark-aligned)
- “Word” boundaries are split by non-letter characters.
- For each word: uppercase the first letter, lowercase the remaining letters.
- Non-letter characters (spaces, punctuation, digits, underscores, hyphens, etc.) are preserved as-is and act as delimiters.
- Unicode-friendly: letters include non-ASCII letters (accents, CJK, etc.).
- Locale-independent: use Unicode case mapping; not affected by system locale.
Metadata
Metadata
Assignees
Labels
No labels