Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transliteration/Segmentation bindings for external implementations #1809

Open
nciric opened this issue Apr 19, 2022 · 4 comments
Open

Transliteration/Segmentation bindings for external implementations #1809

nciric opened this issue Apr 19, 2022 · 4 comments
Labels
A-data Area: Data coverage or quality A-design Area: Architecture or design blocked A dependency must be resolved before this is actionable C-data-infra Component: provider, datagen, fallback, adapters C-segmentation Component: Segmentation

Comments

@nciric
Copy link
Contributor

nciric commented Apr 19, 2022

There are multiple clients that want to use public ICU API for transliteration and segmentation due to its prevalence in the industry => users wouldn't have to migrate code, just build rules and dependencies. These clients would like to provide their own implementation for some of the language pairs.

Our implementation should allow for:

  1. ICU4X not depending on external implementations
  2. ICU4X should expose API for implementers to bind to (abstract class comes to mind in Java/C++)
  3. It should be possible for implementer to point ICU4X build to the new dependency (or to expect external linkage to happen at some point).

For example, there are teams that specialize in ML models for language pair transliteration, esp Indic languages <-> Romanization, that do much better job than rule or dictionary based solutions. They would override some pairs with their solution, but fall back to our general approach for others. Similar problems are present for segmentation, and maybe other APIs.

This is similar to Budou/X issue #1803 - how do we include it into ICU4X without depending on it, and making it a special case.

@nciric nciric added the A-design Area: Architecture or design label Apr 19, 2022
@zbraniecki zbraniecki added C-data-infra Component: provider, datagen, fallback, adapters C-segmentation Component: Segmentation discuss-priority Discuss at the next ICU4X meeting A-data Area: Data coverage or quality labels Apr 19, 2022
@sffc
Copy link
Member

sffc commented Apr 28, 2022

Discussion:

  • @sffc - We should focus on well-tested ICU4X engines, perhaps with features.
  • @nciric - This is for specific clients to override so that they can use their code with our API.
  • @echeran - In both Unicode Properties and MessageFormat, we're talking about making generic interfaces. Should we put interfaces everywhere? This theme keeps coming up.
  • @Manishearth - We often discuss, do we make this thing pluggable? We need to look at who is doing the plugging and what are they plugging into. Let's say we provide a trait for external impls. Are there places in our code where they are plugging in their objects? Because Segmenter is the highest-level API we would have. Are we allowing people to override something inside Segmenter, or override the whole Segmenter?
  • @sffc - If they're overriding the engine just for CJK, then that would be pluggable into ICU4X. But if we're overriding all of Segmenter, a trait would be better. Also, we have the data provider as a way to do overrides; we should avoid adding dozens of different ways to do overrides.

@sffc sffc removed the discuss-priority Discuss at the next ICU4X meeting label Apr 28, 2022
@sffc sffc added this to the ICU4X 1.0 (Polish) milestone May 26, 2022
@sffc sffc self-assigned this May 26, 2022
@sffc
Copy link
Member

sffc commented May 26, 2022

Actino for 1.0 is to make sure we're not boxing ourselves into a corner with the currently proposed APIs.

@sffc
Copy link
Member

sffc commented Jul 27, 2022

I think we are mostly future-proof here. WordBreakSegmenter has private fields, so we could potentially add more private fields in the future. Adding a generic parameter to WordBreakSegmenter would be a breaking change, but we could use trait objects in the interim, and add the generic parameter in 2.0 if required. I am therefore going to mark this issue as resolved for 1.0 purposes.

@sffc sffc modified the milestones: ICU4X 1.0 (Polish), ICU4X 2.0 Jul 27, 2022
@sffc
Copy link
Member

sffc commented Mar 14, 2024

@FrankYFTang is working on this in ICU4C. We should coordinate with him at that time.

We can do this in a non-breaking way.

@sffc sffc modified the milestones: ICU4X 2.0, 1.x Priority ⟨P2⟩ Mar 14, 2024
@sffc sffc removed their assignment Mar 14, 2024
@sffc sffc added the blocked A dependency must be resolved before this is actionable label Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-data Area: Data coverage or quality A-design Area: Architecture or design blocked A dependency must be resolved before this is actionable C-data-infra Component: provider, datagen, fallback, adapters C-segmentation Component: Segmentation
Projects
None yet
Development

No branches or pull requests

3 participants