Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a trie-based alternative to UnicodeSet #2220

Open
hsivonen opened this issue Jul 20, 2022 · 1 comment
Open

Provide a trie-based alternative to UnicodeSet #2220

hsivonen opened this issue Jul 20, 2022 · 1 comment
Labels
A-performance Area: Performance (CPU, Memory) C-unicode Component: Props, sets, tries help wanted Issue needs an assignee T-enhancement Type: Nice-to-have but not required

Comments

@hsivonen
Copy link
Member

The ICU4X composing normalizer uses a UnicodeSet for a fast-path pass-through check while the ICU4C composing normalizer uses a code point trie lookup. ICU4C ends up being faster ever after optimizing other aspects on the ICU4X side, including special-casing the lowest range of the set (the Latin range below the combining diacritics block).

For a known-fragmented compile-time-known set, we should provide an alternative to UnicodeSet that uses the structure of CodePointTrie, but instead of wasting 7 bits of each value byte, divides the length of the value array by 8 and stores 8 logical bits in each byte.

@hsivonen hsivonen added A-performance Area: Performance (CPU, Memory) C-unicode Component: Props, sets, tries labels Jul 20, 2022
@hsivonen
Copy link
Member Author

For the normalizer, #2221 makes more sense.

@sffc sffc added the T-enhancement Type: Nice-to-have but not required label Jul 30, 2022
@sffc sffc added backlog help wanted Issue needs an assignee labels Aug 11, 2022
@sffc sffc added this to the Backlog milestone Dec 22, 2022
@sffc sffc removed the backlog label Dec 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-performance Area: Performance (CPU, Memory) C-unicode Component: Props, sets, tries help wanted Issue needs an assignee T-enhancement Type: Nice-to-have but not required
Projects
None yet
Development

No branches or pull requests

2 participants