Skip to content

Separate to_lowercase() into correct Unicode and simple implementations #26244

Closed
@kornelski

Description

@kornelski

I think there are two distinct use cases for string lowercasing:

  1. to display a lowercased string to a user
  2. to manipulate strings in string algorithms (e.g. building a "case-insensitive" trie or other kind of index. Only having Unicode-aware case-insensitive comparison function is often not enough.)

Currently the locale-unaware to_lowercase tries to do both, but doesn't do either one quite right. It isn't quite correct for the first case (it handles Greek #26035, but doesn't handle Turkish), and it's quirky which makes it difficult to be used safely in the second case.

Therefore I suggest splitting this function into two, e.g., to_locale_lowercase(locale) and to_partial_lowercase(): one that fully implements Unicode (requires locale specified and is good for displaying strings to people), and another which is incorrect in many cases, shouldn't be displayed to users, but preserves simple invariants of ASCII lowercasing that make it useful and safe for algorithms that need code-point-wise lowercasing.

The partial implementation should meet invariants for every valid string a and b:

lower(a) == lower(upper(a)) // No ß/SS
lower(a) == lower(lower(a))
lower(a) == lower(b) <=> upper(a) == upper(b)
lower(a + b) == lower(a) + lower(b) // No Σ/σ/ς

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-feature-requestCategory: A feature request, i.e: not implemented / a PR.T-libs-apiRelevant to the library API team, which will review and decide on the PR/issue.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions