-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Character names #1397
Comments
First we need to decide if we want unames in ICU4X, and if so, we should consider the work of @js-choi. Putting on the backlog for now. |
Note that the article and the proof of concept are still unfinished (even though they are public). I can try to prioritize finishing them, once this topic gets close to leaving the backlog. Just let me know here when that happens. But, even though they are still unfinished, it is already clear that there is a lot of low-hanging fruit to compress all Unicode names, while maintaining fast random bidirectional lookup, using succinct or near-succinct data structures. |
a) Getting Unicode character names can be useful for outputting debug info or data files. b) Unicode character names as input I find not very useful. For example, UTS 18 suggests c) Character names data is annoyingly large. It sounds like you have better ideas for how to compress them than I had, but it's still a lot of data for marginal usefulness. d) You can look at what I cooked up for ICU in 1999: https://github.com/unicode-org/icu/blob/main/tools/unicode/c/genprops/namespropsbuilder.cpp contains the builder code, and the file starts with a brief description of the data format. The data for Unicode 15 is 289kB. (It used to look much better in Unicode 3.0 days...) Code point to name is pretty fast, but the other way is a slow search. |
@markusicu: I definitely agree that the data cost of names is annoyingly large. I, too, would not suggest seriously considering their inclusion in ICU before it is demonstrated that they may be compressed into a succinct data structure (efficient in both time and space). I’m still working on this. With regards to the usefulness of character→name vs. name→character, my own experience comes from implementing many parsers whose grammars need to treat specific characters in special ways. (These parsers only sometimes use regular-expression syntax; oftentimes these are functional combinator parsers or the like, created from composed expression trees.) I have always found it irksome to embed code points directly in my code—I find them to essentially be magic numbers that hinder readability. It’s certainly true that some names are long. (The shortest name, “Ox”, is two letters long but the longest name, “Box Drawings Light Diagonal Upper Centre to Middle Right and Middle Left to Lower Centre”, is 88 characters long.) But the average length of names (when pooling the 35,205 explicitly defined names from the current UCD data files) is 25.78 (e.g., “Domino Tile Vertical-06-06”). This isn’t that bad. Good code often tries to emphasize using self-descriptive variable/function names, and a 26-long name wouldn’t be too out of place of such code. It’s also true that comments can alleviate code points’ magic-number inscrutability, but this is ironically at the cost of increased wordiness (now you have a code-point hex and a longish name) and locality (the explanation of the code point is removed from its original location in the source). (My ulterior motive is eventually to make a proposal to TC39 to add named character literals to the JavaScript language. But I would not dare to do so until I can allay the engine vendors’ concerns of storage, memory, and retrieval time—hence this project.) Funnily enough, I (and some other TC39 members I’ve spoken to) may be confident about the usefulness of name→character lookup…but we’ve been much less certain about the usefulness of character→name retrieval. The best use case that I could imagine for the latter was metaprogramming (e.g., JavaScript transpilers and code minifiers). But your point about debugging general text data for human display is well taken. Your namespropsbuilder.cpp from 1999 is fascinating. The fact that its token-based compression was already able to get down to 289 kB for Unicode 15 is impressive. I think there’s a lot of other low-hanging fruit to squeeze it much more (while improving name→character time performance). I don’t expect my demo to be done soon, but I’ll follow up when I make some significant progress. Cheers! |
Has any decision been made regarding whether to include this function into icu4x? For context, it is being provided an old version of unic, a third party crate unicode_names2. To illustrate whether this is useful, this functionality is also used in crates such as charset-normalizer-rs to query certain properties that are not exposed by the current character property database. |
I think we want this, but someone has to have time to implement it |
By someone, you mean from within the organisation right? On the other hand, if atleast initially relevant project requirements could be communicated and if necessary relaxed (at least initially) I might at least try to contribute an early version. |
Yes, we would accept something externally contributed and can provide feedback on how to make it happen. The tricky thing to figure out is what data structure to use here: @sffc do we have any thoughts on what we want to use? Are there ICU4C data structures we plan to wrap, or should we be using a varzerovec or something? |
A few years ago, I wrote an unfinished article exploring compressing the set of Unicode names with efficient bidirectional random access, using succinct data structures – such as IBiSRP+DAC-L-TT (see Brisaboa et al. (2019)). There’s also been a lot of other recent research on succinct data structures that should be of interest. I sadly haven't had the bandwidth to finish the article and its JavaScript proof of concept, but I can definitely see a path to a compact data structure with efficient bidirectional random access. The progress I was making suggested that the data structure could approach the information-theoretic lower bound, somewhere around 110 kB. (I was already able to reduce Unicode names down to with efficient random access to 640 kB, and I had many more optimizations to do when I ran out of time. I hope to revisit this question sometime, because it was a fun exercise to do.) |
I'd be fine landing this as two standalone mappers, one for each direction. Name-to-character should be implementable using Regarding your exploration of bidirectional data structures for this type of thing: this sounds like something that would be useful for ICU4X, so long as the metrics are where we want them to be. A constraint I've hit up against while trying to implement things like this is that we want case-insensitive lookup in one direction but want to get case-normalized results in the other direction. Is this a constraint you've considered? |
Wearing my hat as ICU4X tech lead, I would rather make sure we land a proper ICU4X-conventional solution than an interim solution that needs work. We have a lot of experimental code in ICU4X and it creates a maintenance burden and sits for a long time until someone invests the time to bring it up to spec. Since the Rust ecosystem seems to already have this functionality in another crate, I'd rather wait to land it in ICU4X until we can land it properly. However, I want to emphasize that the ICU4X contribution process is about the cleanest it's ever been. I've seen contributors come up to speed in a matter of hours. We have tutorials and lots of guardrails in place. I think you may be overestimating the effort that would be required to implement this feature "the ICU4X way". |
Thanks for the reply. Answering this question: I didn’t run into any hard obstacle when it came to case sensitivity of names. This was my approach:
The UAX44-LM2 fuzzy folding was implemented in this JavaScript module and it is tested in this test suite. I believe Python’s |
Ok, if you store the strings together then it makes sense that you can do "on-the-fly" case folding of the names. I don't know how to make it work though with a trie-like data structure, with one node per character. For example, if "BOOKS" and "BOOK_STORE" are both in the trie, and we get the query "BOOKSTORE", then after we get to the K node we don't know whether to to to the S node or the _ node. |
Just to clarify: By “case-insensitive lookup” and “case folding”, are you referring strictly to ignoring the case of letters in names? In any case, neither case insensitivity nor UAX44-LM2 should be a barrier to using ZeroTrie for fuzzy/loose name→character lookup. ZeroTrie only needs to store the “folded” versions of the names:
As you point out earlier, ZeroTrie alone is insufficient for character→name retrieval. To support character→name retrieval, we would have to either (1) combine ZeroTrie with another data structure that can efficiently reconstruct the original names (e.g., a mapping from characters to paths through the trie’s nodes, along with data about any underscores/spaces/hyphens discarded from the trie’s names) or (2) replace ZeroTrie with a bidirectional string dictionary like Brisaboa et al.’s IBiS. An aside about trie-based compressed string dictionaries vs. IBiSMany interesting new succinct bidirectional string dictionaries use compressed tries: e.g., Grossi and Ottaviano’s path-decomposed tries (PDTs) and Boffa et al.’s CoCo-tries. However, all of these trie-based approaches are limited by how tries must reconstruct strings from the same data that they use to look up strings’ values. They cannot support fuzzy/loose matching using with UAX44-LM2, as far as I know. Brisaboa et al.’s “IBiS” compressed string dictionaries do not use tries, and because of this they do support fuzzy/loose matching with UAX44-LM2: They can somewhat decouple how they look up strings from how they actually store the strings. An IBiS dictionary can store names’ substrings so that they are binary searchable by their folded versions. This is valid as long as each entire name’s folded version is identical to the concatenation of its folded substrings. The original names’ substrings would be what the IBiS dictionary actually stores: e.g., The substrings in folded-version order together form the binary search tree. In this binary tree, looking up |
Thanks for the detailed response! I was mainly thinking about another application of this in time zone names. "Foo/Bar" and "Foo/BAZ" are both valid entries in the ZeroTrie. If I store them case-folded, I can't retrieve the original case-canonical form, and if I store them as case-canonical, I can't do case-insensitive lookup since "B-a" and "B-A" go to different nodes in the trie. I will read more about IBiS; thanks for the pointer! |
Hey, I just spoke with @sffc at an Ecma TC39 meeting, and he referred me to here.
ICU4X will eventually need to support lookup of Unicode character names (the
Name
property of code points, as well as character name aliases and named character sequences).I’ve been working on trying to compress the collection of Unicode character names into a succinct data structure, occupying as little space as possible (my goal is 100–200 kB) while maintaining efficient bidirectional random access. To that end, in my spare time I’ve been working on an (unfinished) article describing the data structure, as well as an (unfinished) reference JavaScript implementation.
I don’t anticipate my work being done until several months from now at the earliest. But @sffc tells me that work on character names in ICU4X hasn’t started yet either. When ICU4X does reach the point when character names start to get implemented, I’d love to help out however I can. In the meantime, I will continue to work on the JavaScript library before porting it to Rust.
The text was updated successfully, but these errors were encountered: