Skip to content

Implement native function of initcap #1549

@slfan1989

Description

@slfan1989

Summary

Add a native, vectorized initcap(str) that capitalizes the first letter of each word and lowercases the rest, matching Apache Spark semantics.

Motivation

  • Align with Spark SQL string functions to improve coverage.
  • Use an Arrow-native implementation to boost performance on large columns.

Scope

  • New scalar function: initcap(str)
  • Input types: STRING (Utf8, LargeUtf8)
  • Output type: STRING
  • Null semantics: initcap(NULL) → NULL
  • Empty string returns empty string

Expected Semantics (Spark-aligned)

  • “Word” boundaries are split by non-letter characters.
  • For each word: uppercase the first letter, lowercase the remaining letters.
  • Non-letter characters (spaces, punctuation, digits, underscores, hyphens, etc.) are preserved as-is and act as delimiters.
  • Unicode-friendly: letters include non-ASCII letters (accents, CJK, etc.).
  • Locale-independent: use Unicode case mapping; not affected by system locale.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions