Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing TZ exemplar city generation #3204

Merged
merged 10 commits into from
Mar 21, 2023
Merged

Conversation

robertbastian
Copy link
Member

#3181

CLDR 43 removes trivial display names entries, so iterating through display names will no longer work. We instead iterate through the BCP47/Tzid map.

The way this code deals with aliases seems to be unintentional, as it takes the alphabetically last value. I've preserved this behaviour, but it's something we should figure out.

sffc
sffc previously approved these changes Mar 16, 2023
@robertbastian
Copy link
Member Author

@nordzilla could you shed some light on the alias logic?

sffc
sffc previously approved these changes Mar 16, 2023
@robertbastian
Copy link
Member Author

I've rewritten the logic. It now uses the first TZDB ID in the alias field that has an exemplar city (previously I think it was implicitly the last), and falls back to using the last part of the TZDB ID (i.e. America/Los_Angeles -> Los Angeles) if there are no display names.

This leads to some data diffs, i.e. kipho changes from Kanton to Enderbury, because its aliases are Pacific/Enderbury Pacific/Kanton, Montreal is added through the fallback (it doesn't seem to have display names?), and English also gains Currie, Santa Isabel, and Honolulu through the fallback (other languages already have them, no idea why these aren't included in English in CLDR 42).

@robertbastian robertbastian requested a review from sffc March 17, 2023 11:56
@Manishearth Manishearth removed their request for review March 17, 2023 16:53
@robertbastian robertbastian requested a review from sffc March 17, 2023 17:46
@@ -209,7 +210,7 @@
"kgfru": "Bishkek",
"khpnh": "Phnom Penh",
"kicxi": "Kiritimati",
"kipho": "Kanton",
"kipho": "Enderbury",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kanton becoming Enderbury is wrong, but expected, as Pacific/Enderbury is listed before Pacific/Kanton. This problem is tracked in https://unicode-org.atlassian.net/browse/CLDR-14453. en works around it by having removed the display name for Pacific/Enderbury, so that Pacific/Kanton is used.

provider/datagen/src/transform/cldr/time_zones/convert.rs Outdated Show resolved Hide resolved
.map(|aliases| (bcp47, aliases))
})
.filter_map(|(bcp47, aliases)| {
aliases
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: There are 2 definitions of the canonical IANA alias for each time zone; one is the CLDR alias, which is more historical and always stable, and the other is the IANA alias, which tends to be more modern but is subject to change. The order of the aliases in the bcp47 file does not seem to correspond to whether it is the CLDR alias or the IANA alias. I know there are open issues about this, as well as an ECMA-402 proposal in the works to settle on whose alias is the one the industry should use.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robertbastian robertbastian requested a review from sffc March 20, 2023 09:06
@robertbastian
Copy link
Member Author

I'm not sure if we should include the last commit, it might lead to problems in the future. We should fix the data instead.

@robertbastian
Copy link
Member Author

It already doesn't work with CLDR43 as that doesn't allow us to distinguish between explicitly set and key-derived values.

sffc
sffc previously approved these changes Mar 21, 2023
@sffc
Copy link
Member

sffc commented Mar 21, 2023

I'm not sure if we should include the last commit, it might lead to problems in the future. We should fix the data instead.

Until CLDR gets their act together, any of the following are reasonable behaviors:

  1. Get the name for the first alias; err if it cannot be found (special case montreal)
  2. Get the name for the first alias; don't add anything to the icu4x data struct if it cannot be found
  3. Get the name for the first alias that has a name

I think the PR is currently implementing 1, right?

@robertbastian
Copy link
Member Author

This does:

  • Get the name of the first alias
  • Get the city for the alias, if found return
  • If the alias starts with Etc, skip
  • If the alias is a single segment, skip
  • Use the last segment of the alias

The last case currently only happens for the three English names that get added here, but will happen a lot in CLDR 43

@robertbastian robertbastian merged commit 0289180 into unicode-org:main Mar 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants