Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine-grained data filtering #953

Open
sffc opened this issue Aug 12, 2021 · 5 comments
Open

Fine-grained data filtering #953

sffc opened this issue Aug 12, 2021 · 5 comments
Assignees
Labels
A-tailoring Area: User preferences, locale extensions, tailoring C-data-infra Component: provider, datagen, fallback, adapters S-epic Size: Major project (create smaller child issues) T-enhancement Type: Nice-to-have but not required

Comments

@sffc
Copy link
Member

sffc commented Aug 12, 2021

data_phases.md (#498) discusses the three phases of information: compile time, construction time, and format time. Currently, static data slicing (#948) is only capable of filtering based on the ResourceKey (compile time information). However, @iainireland has noted that it may be useful to filter ResourceOptions or data structs as well.

Some examples of potentially legitimate use cases:

  • Remove currency display names except for the currencies used in a particular country
  • Remove non-Gregorian calendar data or non-Latin numbering systems in a financial app
  • Remove display names for time zones outside North America

Such fine-grained filtering is very tricky, because you risk removing data that has legitimate i18n value. For example, one might attempt to remove right-to-left support from an app launching in Spain, only to discover that there are peoples in Spain who communicate in the Hebrew alphabet. Or, you might attempt to remove the Buddhist calendar from an app launching in Oklahoma, only to discover that Oklahoma City is home to 9 Buddhist temples.

I believe the best path forward for fine-grained filtering in ICU4X is to sandbox decisions into specific flags. We should start by identifying the use cases, and then add flags corresponding to those use cases that retain high-quality i18n behavior.

This issue is to track the design and implementation of fine-grained data filtering in ICU4X.

@sffc sffc added help wanted Issue needs an assignee question Unresolved questions; type unclear C-data-infra Component: provider, datagen, fallback, adapters A-tailoring Area: User preferences, locale extensions, tailoring S-epic Size: Major project (create smaller child issues) labels Aug 12, 2021
@sffc sffc added backlog T-enhancement Type: Nice-to-have but not required and removed question Unresolved questions; type unclear labels Aug 26, 2021
@sffc
Copy link
Member Author

sffc commented Aug 26, 2021

Adding this to the backlog; when we have a client with a clear need for this, we should schedule work on this issue with that use case in mind.

@sffc sffc added this to the Backlog milestone Dec 22, 2022
@sffc sffc removed the backlog label Dec 22, 2022
@sffc sffc removed the help wanted Issue needs an assignee label Apr 15, 2023
@sffc sffc self-assigned this Apr 15, 2023
@sffc sffc added the discuss Discuss at a future ICU4X-SC meeting label Apr 15, 2023
@sffc
Copy link
Member Author

sffc commented Apr 15, 2023

I'm pulling this up into the 1.3 milestone, since we are getting close to wanting to release components that need this type of data slicing.

The easiest and most robust solution is to do what we've now done with Collator, Japanese Eras, Segmenter, and Locale Expander, which is to create multiple keys: one for core data and one for extended data. This has the advantage that it works automatically with data slicing without any additional infrastructure needed. A downside of this approach is that we need to define rigid boundaries between the core and extended data. Another downside is that if we need many levels of granularity, we risk hurting the performance of the resulting formatter, because each key needs to be checked separately for the required data. But, if we can establish a very good separation between core and extended, then this approach seems feasible.

The two components that are coming up soon that need this are Currency Display Names and Locale Display Names.

One way to make coarse slices for currency names would be, all currencies that are used in a particular locale get display names (all others fall back to ISO code). It's a bit less clear how to make the coarse slices for locale display names (language, script, region, variants, and extensions).

Adding this to the discussion agenda.

@robertbastian robertbastian added the discuss-priority Discuss at the next ICU4X meeting label Jun 21, 2023
@sffc
Copy link
Member Author

sffc commented Jun 22, 2023

Discuss with:

Optional:

@sffc sffc added the discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band label Jun 22, 2023
@sffc sffc removed discuss Discuss at a future ICU4X-SC meeting discuss-priority Discuss at the next ICU4X meeting discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band labels Jul 27, 2023
@sffc
Copy link
Member Author

sffc commented Aug 24, 2023

Auxiliary keys are implemented, and there is a follow-up in #3907 to add filtering for them.

@sffc sffc closed this as completed Aug 24, 2023
@sffc
Copy link
Member Author

sffc commented Aug 24, 2023

We'll track filtering here instead of in #3907

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tailoring Area: User preferences, locale extensions, tailoring C-data-infra Component: provider, datagen, fallback, adapters S-epic Size: Major project (create smaller child issues) T-enhancement Type: Nice-to-have but not required
Projects
None yet
Development

No branches or pull requests

2 participants