Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of overlap patterns in semantic datetime #5387

Open
sffc opened this issue Aug 16, 2024 · 0 comments
Open

Improve handling of overlap patterns in semantic datetime #5387

sffc opened this issue Aug 16, 2024 · 0 comments
Labels
C-datetime Component: datetime, calendars, time zones

Comments

@sffc
Copy link
Member

sffc commented Aug 16, 2024

Definition: an "overlap pattern" refers to a pattern that has fields from multiple categories (date, time, and time zone). For example, a pattern with Weekday and Hour is an overlap pattern, and a pattern with Hour and Zone is an overlap pattern.

Overlap patterns require a little bit of special code that I added in #5356. Essentially we just need to detect that the field set corresponds to a known overlap pattern and then load the overlap pattern instead of the standard date/time/zone patterns with the glue patterns.

However, there are a few issues to be resolved.

First, overlap patterns contribute fairly substantially to code size. As noted in #1317 (comment), there are 16 field sets corresponding to overlap patterns encoded in CLDR. A more efficient representation would be the one I implemented in #5357. However, I cannot implement that representation cleanly on top of current CLDR due to messiness in the data.

Second, it's not clear how widely used the overlap patterns are, and because of this, the data is not very high quality. For example, consider the following CLDR data in three locales:

Locale "no":

                "Ehm": "E h:mm a",
                "EHm": "E 'kl'. HH:mm",
                "Ehms": "E h:mm:ss a",
                "EHms": "E 'kl'. HH:mm:ss",

Locale "fr-CM":

                "Ehm": "E h:mm",
                "EHm": "E HH:mm",
                "Ehms": "E h:mm:ss",
                "EHms": "E HH:mm:ss",

Locale "de":

                "Ehm": "E h:mm a",
                "EHm": "E, HH:mm",
                "Ehms": "E, h:mm:ss a",
                "EHms": "E, HH:mm:ss",

All three of these locales have peculiarities in their data, which I confirmed with native speakers to be suspicious:

  1. Norwegian uses 'kl.' for some of the joiners but not for others
  2. Cameroon French is missing the day period on their 12-hour patterns
  3. German has a comma in 3 out of the 4 patterns

If we could clean up the data somehow, I believe that #5357 would be feasible.

Third, it would likely be better quality if the time patterns were substituted in here instead of using the overlap pattern representation of the time. The standalone time patterns get more scrutiny and are likely better in quality.

For example, in most of the locales above, 12-hour time is rarely used, so the translators probably did not put much thought into the 12-hour time patterns as they did the 24-hour time patterns. It does not make sense that ICU4X needs to go out of its way and increase data size for everyone in order to support patterns that are hardly ever used and which are not high in quality.

Also missing in the current implementation is the handling of non-default hour cycles. I fall back on glue patterns for those, but ideally we would use the overlap patterns CLDR gives us.

I would like to not block 2.0 on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-datetime Component: datetime, calendars, time zones
Projects
None yet
Development

No branches or pull requests

1 participant