Description
I think there are two distinct use cases for string lowercasing:
- to display a lowercased string to a user
- to manipulate strings in string algorithms (e.g. building a "case-insensitive" trie or other kind of index. Only having Unicode-aware case-insensitive comparison function is often not enough.)
Currently the locale-unaware to_lowercase
tries to do both, but doesn't do either one quite right. It isn't quite correct for the first case (it handles Greek #26035, but doesn't handle Turkish), and it's quirky which makes it difficult to be used safely in the second case.
Therefore I suggest splitting this function into two, e.g., to_locale_lowercase(locale)
and to_partial_lowercase()
: one that fully implements Unicode (requires locale specified and is good for displaying strings to people), and another which is incorrect in many cases, shouldn't be displayed to users, but preserves simple invariants of ASCII lowercasing that make it useful and safe for algorithms that need code-point-wise lowercasing.
The partial implementation should meet invariants for every valid string a
and b
:
lower(a) == lower(upper(a)) // No ß/SS
lower(a) == lower(lower(a))
lower(a) == lower(b) <=> upper(a) == upper(b)
lower(a + b) == lower(a) + lower(b) // No Σ/σ/ς