Skip to content

Improved collation support for non-ascii case-insensitive text-match #567

@tobixen

Description

@tobixen

RFC4791 requires the server to support searching according to two collations - i;octet for binary match and i;ascii-casemap allowing case-insensitive search, with the latter being the default. In caldav v2.2.2 there is test code covering both "case sensitive" and "case insensitive" searches. The problem with i;ascii-casemap is that it only works for ascii characters - causing mismatches between naïve and NAÏVE, cliché and CLICHÉ, smörgåsbord and SMØRGÅSBORD, not to forget millions of words in non-English languages, complete non-latin scripts, etc.

RFC4790 specifies a i;unicode-casemap collation, which may or may not be supported by the server. RFC4791 section 7.5.1 says that it's possible to ask the server what collations it support.

To resolve this issue ...

  • Case-insensitive searches should work for non-ascii characters on all servers supporting it.
  • Library should detect non-ascii characters and do workarounds for servers not supporting case insensitivity on non-ascii characters.

Locale support would be nice (i.e. "istanbul" should match with İstanbul in Turkish locale), but not required (this may be a very deep rabbit hole - one would like istanbul to match both İstanbul and Istanbul, at the other hand there may be too many false negatives if the matching is too liberal).

A good test-case may include English loan-words like crème brûlée and naïve, typical Scandinavian words like Smörgåsbord, Blåbærsyltetøy, some French and Turkish words, as well as Ukrainian text.

The i;unicode-casemap may not be sufficient to handle all languages, ref the Istanbul example above.

There is an example file in the example directory that may need brush-up as well.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions