Provide a trie-based alternative to UnicodeSet #2220

hsivonen · 2022-07-20T13:00:52Z

The ICU4X composing normalizer uses a UnicodeSet for a fast-path pass-through check while the ICU4C composing normalizer uses a code point trie lookup. ICU4C ends up being faster ever after optimizing other aspects on the ICU4X side, including special-casing the lowest range of the set (the Latin range below the combining diacritics block).

For a known-fragmented compile-time-known set, we should provide an alternative to UnicodeSet that uses the structure of CodePointTrie, but instead of wasting 7 bits of each value byte, divides the length of the value array by 8 and stores 8 logical bits in each byte.

The text was updated successfully, but these errors were encountered:

hsivonen · 2022-07-20T16:50:36Z

For the normalizer, #2221 makes more sense.

hsivonen added A-performance Area: Performance (CPU, Memory) C-unicode Component: Props, sets, tries labels Jul 20, 2022

sffc added the T-enhancement Type: Nice-to-have but not required label Jul 30, 2022

sffc added backlog help wanted Issue needs an assignee labels Aug 11, 2022

sffc added this to the Backlog milestone Dec 22, 2022

sffc removed the backlog label Dec 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide a trie-based alternative to UnicodeSet #2220

Provide a trie-based alternative to UnicodeSet #2220

hsivonen commented Jul 20, 2022

hsivonen commented Jul 20, 2022

Provide a trie-based alternative to UnicodeSet #2220

Provide a trie-based alternative to UnicodeSet #2220

Comments

hsivonen commented Jul 20, 2022

hsivonen commented Jul 20, 2022