Fine-grained data filtering #953

sffc · 2021-08-12T00:40:17Z

data_phases.md (#498) discusses the three phases of information: compile time, construction time, and format time. Currently, static data slicing (#948) is only capable of filtering based on the ResourceKey (compile time information). However, @iainireland has noted that it may be useful to filter ResourceOptions or data structs as well.

Some examples of potentially legitimate use cases:

Remove currency display names except for the currencies used in a particular country
Remove non-Gregorian calendar data or non-Latin numbering systems in a financial app
Remove display names for time zones outside North America

Such fine-grained filtering is very tricky, because you risk removing data that has legitimate i18n value. For example, one might attempt to remove right-to-left support from an app launching in Spain, only to discover that there are peoples in Spain who communicate in the Hebrew alphabet. Or, you might attempt to remove the Buddhist calendar from an app launching in Oklahoma, only to discover that Oklahoma City is home to 9 Buddhist temples.

I believe the best path forward for fine-grained filtering in ICU4X is to sandbox decisions into specific flags. We should start by identifying the use cases, and then add flags corresponding to those use cases that retain high-quality i18n behavior.

This issue is to track the design and implementation of fine-grained data filtering in ICU4X.

sffc · 2021-08-26T18:18:34Z

Adding this to the backlog; when we have a client with a clear need for this, we should schedule work on this issue with that use case in mind.

sffc · 2023-04-15T19:28:13Z

I'm pulling this up into the 1.3 milestone, since we are getting close to wanting to release components that need this type of data slicing.

The easiest and most robust solution is to do what we've now done with Collator, Japanese Eras, Segmenter, and Locale Expander, which is to create multiple keys: one for core data and one for extended data. This has the advantage that it works automatically with data slicing without any additional infrastructure needed. A downside of this approach is that we need to define rigid boundaries between the core and extended data. Another downside is that if we need many levels of granularity, we risk hurting the performance of the resulting formatter, because each key needs to be checked separately for the required data. But, if we can establish a very good separation between core and extended, then this approach seems feasible.

The two components that are coming up soon that need this are Currency Display Names and Locale Display Names.

One way to make coarse slices for currency names would be, all currencies that are used in a particular locale get display names (all others fall back to ISO code). It's a bit less clear how to make the coarse slices for locale display names (language, script, region, variants, and extensions).

Adding this to the discussion agenda.

sffc · 2023-06-22T17:48:45Z

Discuss with:

Optional:

@younies

sffc · 2023-08-24T17:24:43Z

Auxiliary keys are implemented, and there is a follow-up in #3907 to add filtering for them.

sffc · 2023-08-24T17:25:52Z

We'll track filtering here instead of in #3907

sffc added help wanted Issue needs an assignee question Unresolved questions; type unclear C-data-infra Component: provider, datagen, fallback, adapters A-tailoring Area: User preferences, locale extensions, tailoring S-epic Size: Major project (create smaller child issues) labels Aug 12, 2021

sffc mentioned this issue Aug 12, 2021

Add draft design doc involving phases of data provider information #498

Merged

sffc added backlog T-enhancement Type: Nice-to-have but not required and removed question Unresolved questions; type unclear labels Aug 26, 2021

sffc added this to the Backlog milestone Dec 22, 2022

sffc removed the backlog label Dec 22, 2022

sffc modified the milestones: Backlog ⟨P4⟩, 1.3 Blocking ⟨P1⟩ Apr 15, 2023

sffc removed the help wanted Issue needs an assignee label Apr 15, 2023

sffc self-assigned this Apr 15, 2023

sffc added the discuss Discuss at a future ICU4X-SC meeting label Apr 15, 2023

robertbastian added the discuss-priority Discuss at the next ICU4X meeting label Jun 21, 2023

sffc added the discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band label Jun 22, 2023

sffc removed discuss Discuss at a future ICU4X-SC meeting discuss-priority Discuss at the next ICU4X meeting discuss-triaged The stakeholders for this issue have been identified and it can be discussed out-of-band labels Jul 27, 2023

sffc closed this as completed Aug 24, 2023

sffc reopened this Aug 24, 2023

sffc modified the milestones: 1.3 Blocking ⟨P1⟩, 1.x Priority ⟨P2⟩ Aug 24, 2023

sffc mentioned this issue Aug 24, 2023

Datagen filtering for auxiliary keys #3907

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fine-grained data filtering #953

Fine-grained data filtering #953

sffc commented Aug 12, 2021

sffc commented Aug 26, 2021

sffc commented Apr 15, 2023

sffc commented Jun 22, 2023 •

edited

Loading

sffc commented Aug 24, 2023

sffc commented Aug 24, 2023

Fine-grained data filtering #953

Fine-grained data filtering #953

Comments

sffc commented Aug 12, 2021

sffc commented Aug 26, 2021

sffc commented Apr 15, 2023

sffc commented Jun 22, 2023 • edited Loading

sffc commented Aug 24, 2023

sffc commented Aug 24, 2023

sffc commented Jun 22, 2023 •

edited

Loading