-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce ICU4X's dependence on ICU4C data #4602
Comments
Some background here: The UCD is heavily pre-processed in the ICU4C data build into a form known as One potential advantage to leveraging ICU4C for these larger property blobs is that it paves the way for us to potentially share data files for some of these structures between C and X. So while I'm not opposed to heading in this direction, whoever takes this issue should research exactly the nature of the machinery we're using in ICU4C, study the impact on cross-compatible data files, and create more bite-sized milestones. |
Other Rust users are already reading the UCD, so it can't be that hard?
I don't see what this has to do with runtime representation. Neither the current text files in icuexportdata nor the UCD text files are a runtime format. |
Can you confirm whether collation data is CLDR-derived? |
I would be in favor of this in the long run. I'm not sure how much work it is and if it's worth it. |
The root collation is built separately from the tailorings. The root is built from DUCET with LDML root refinements applied. The tool that builds it is Once the root has been built, Of the types of data mentioned in this issue, building the collation data without ICU4C would be by far the largest effort. The second-largest effort would be much, much simpler, but still complicated. The second-largest effort would be building the UTS 46 data into the form of a special normalization. |
Discuss with: Optional: |
Quick notes
|
Thanks. For normalization especially I would somewhat prefer to rely on ppucd or directly on UCD. The current situation is extremely suboptimal: the normalization properties are exported as a part of icuexportdata, ICU4C-using C++ code that is not particularly easy to understand. The group of people that needs to debug that code (ICU4X) is not the group of people that can easily understand it (ICU4C devs), and I've already had to spend a bunch of time fixing segfaults and other issues in it. Still, I'm not convinced that the code will be equal complexity if maintained by us in ICU4X datagen: the ICU4C code is able to invoke normalizer, whereas we would not be able to invoke our own normalizer and may have to do some manual work here. I'm hoping we can maintain the same complexity (it has to exist somewhere) but I'm not fully clear on everything that code does to be sure. Collation is messier and I'm less sure if we should try to reduce that ICU4C dependency yet. |
Is there an opportunity to use a |
I don't think so, because the ICU4C normalizer will rely on ICU4C normalizer data. (and that's the main "core algorithm" of consequence) |
This arises from getting the data into the form that the ICU4X normalizer expects and, potentially, from a certain lack of polish. It doesn't arise from C++ or ICU4C.
I wrote that code, so while we may have a truck number problem, I don't think that analysis of who debugs and who understands accurately describes the situation. That code needs 3 things from ICU4C:
The third item would take the most effort to replicate from UCD data files without ICU4C. Overall, I think it would be ideal if ICU4X was self-hosted, but as a practical matter, I think we should put engineering effort into reaching ECMA-402 coverage of the ICU4X feature set instead of putting engineering effort into decoupling the data pipeline from ICU4C at this time. The current situation with Unicode 16 introducing characters with novel normalization behaviors delaying ICU4C's data update and that blocking ICU4X's data update makes the whole thing look scarier than it is in the usual case.
No. Wasm wouldn't solve the problem that the ICU4C normalizer didn't anticipate the novel normalization behaviors that are now blocking the normalization data update. |
When I was previously fixing bugs here it did involve chasing down ICU4C APIs to understand their nuances. Like, my experience with this code is precisely that I needed to be an ICU4C expert to fix it. And there is very little reason for ICU4C devs to be looking at this code; it is almost always going to be ICU4X code.
The tricky thing here is not just that it blocks our update: it's that Unicode expects implementors to have trouble with this update, and having time to fix things is crucial.
Good news: the third item is not necessary to fix the problem we're facing: we can continue to do UTS 46 mappings via icuexportdata but move (2) over to datagen since we only need the single-character recursive stuff (and some other things) which can be done directly from data. (@eggrobin helped with this observation) |
We don't need UTS 46 to do alpha testing on normalization, but the ICU4X normalizer data merges auxiliary tables for UTS 46 and the K normalizations, so for actual deployment, all the normalizations need to come from the same builder. |
(@markusicu and @hsivonen deep dive on normalization data pipeline)
Conclusions:
The above bullet points can be actioned, but are subject to the normal ICU4X prioritization process. LGTM: @sffc @eggrobin @markusicu @Manishearth @robertbastian Furthermore, making ICU4X fully independent of ICU4C, and vice-versa, should be our long-term goal. Both projects should read directly from UCD or other shared sources, and those sources should ship data useful for clients. LGTM: @robertbastian @Manishearth, @eggrobin (@sffc, @hsivonen, @markusicu in principle) |
It would be nice to cut out the middle man and construct as much data as possible directly from "the source". The
icuexportdata
we currently use contains:ucd_parse
crate) and generate the data from it.I think it's desirable for ICU4X to be as independent of ICU4C as possible, in order to identify and upstream any custom ICU4C behaviour.
The text was updated successfully, but these errors were encountered: