Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design architecture around low-cost locale parsing and storage #958

Closed
sffc opened this issue Aug 17, 2021 · 3 comments
Closed

Design architecture around low-cost locale parsing and storage #958

sffc opened this issue Aug 17, 2021 · 3 comments
Assignees
Labels
C-locale Component: Locale identifiers, BCP47 R-obsolete Resolution: This issue is no longer relevant S-large Size: A few weeks (larger feature, major refactoring) T-core Type: Required functionality
Milestone

Comments

@sffc
Copy link
Member

sffc commented Aug 17, 2021

Context: The Locale::from_bytes() function contributes a fair bit of bloat to the ICU4X binary. I am brainstorming ways to make it lighter.

In many cases, we encounter valid BCP-47 strings, and we need to parse them into a Locale.

We parse the language, script, and region subtags into TinyStr fields. However, for variants and Unicode extensions, we almost always need to allocate memory. Why? We could instead just point to substrings of the input BCP-47.

The Locale struct would be something along the lines of:

struct ZeroCopyLocale<'data> {
    language: TinyStr4,
    script: Option<TinyStr4>,
    region: Option<TinyStr4>,
    extras: Option<&'data str>, // or Cow<'data, str>
}

// Example: en-US-posix-u-ca-hebrew
ZeroCopyLocale {
    language: "en",
    script: None,
    region: Some("US"),
    extras: Some("posix-u-ca-hebrew"),
}

Meanwhile, the function to query the variants or extension subtags would essentially perform "on-the-fly" parsing.

Advantages of this model:

  1. Extremely fast and small code to parse BCP-47 (zero memory allocations)
  2. Does not incur the cost of parsing variants and extensions when they aren't needed
  3. Zero-copy compatibility with Yoke/DataPayload
  4. No need to distinguish LangID from Locale

Disadvantages:

  1. Adds a lifetime that wasn't there before (not self-contained)
  2. Lack of mutability / limited use cases

Should we consider doing something like UnicodeSet/UnicodeSetBuilder here? Perhaps we could use the existing code as a LocaleBuilder of sorts, and ZeroCopyLocale could be the lightweight version for runtime use.

CC @zbraniecki @Manishearth

@sffc sffc added C-locale Component: Locale identifiers, BCP47 discuss Discuss at a future ICU4X-SC meeting labels Aug 17, 2021
@sffc
Copy link
Member Author

sffc commented Aug 26, 2021

  • @zbraniecki - I'd like to avoid 7 locale classes for 7 different use cases. Both use cases are useful; most use cases won't care about serializing extensions. I think we should be looking more at how to reduce what we store after we finish parsing, since the serialization is rarely used.
  • @Manishearth - Adding lifetimes to locales will be annoying. Ideally locales are small and easy to move around. One proposal is that we don't parse the whole locale; I think that's a good idea, but it doesn't need to be coupled with zero-copy.
  • @zbraniecki - Are you worried about objects that have invalid data stored in ICU4X?
  • @sffc - I envision ZeroCopyLocale as more of a reader of BCP47, not intended for long-term storage. I see it as a useful ephemeral locale parser.
  • @mihnita - We shouldn't randomly drop things, because it's very error-prone.
  • @sffc - +1
  • @nciric - Fuchsia and Dart have their own locale classes. Fuchsia makes heavy use of locale extensions. The one in Dart in particular is very small and fast. Also, even if we do lazy parsing of extensions, we still need to carry the code, right?
  • @mihnita - In Android, we are looking at storing preferences in the locale identifier (-u-hc, etc). And since locale parsing can be slow, caching things can help.
  • @zbraniecki - I see two extremes. Either we pick one locale that does what everyone needs, or we could introduce a special crate with 50+ versions to cover different use cases. But one issue is that you might have code written with one locale type, and then the conditions change and you need to change all your code to use a different locale type.
  • @dminor - We already have a separation between Locale and LocaleCanonicalizer. We have some space where we can shift around some of the validation and canonicalization.
  • @zbraniecki - There's a tremendous difference between resolving aliases and making a semantically valid locale. I want to avoid us storing semantically invalid locales.
  • @sffc - I think we should lean more heavily on BCP-47 strings for storage. We can have small no-alloc functions to validate the BCP-47 or to pull a specific subtag out of it.
  • @nciric - Are we worried about the code size, or the memory size? And in JS we have functions that could accept either a Locale or a String as input.
  • @zbraniecki - Validation versus portability: if we use lenient string storage... maybe that should be a trait on a string. Because there are some cases where you just care about getting the region, and that could be a trait on the string. You could end up with a class that implements those traits on top of a string.
  • @Manishearth - Locales are zero-alloc and non-borrowing in 99% of cases, unless there is an extension. With the string-based model, we need to either always alloc or always borrow.
  • @sffc - I like the direction @zbraniecki suggested.
  • @mihnita - I like that direction too. Two more points. (1) I might think I don't need something, but maybe some other code I used needs it, which is why you want to generally carry everything with you. (2) In Java, you can create an invalid locale, and carry semantically invalid locales with you, and it's fine. I can call functions to normalize it, but it's fine to move it around. You can move the price around to different areas.

@sffc sffc added S-large Size: A few weeks (larger feature, major refactoring) T-core Type: Required functionality and removed discuss Discuss at a future ICU4X-SC meeting labels Aug 26, 2021
@sffc sffc changed the title Why not make a zero-copy Locale? Design architecture around low-cost locale parsing and storage Aug 26, 2021
@sffc sffc added this to the ICU4X 0.5 milestone Aug 26, 2021
@sffc sffc added the help wanted Issue needs an assignee label Aug 26, 2021
@sffc sffc modified the milestones: ICU4X 0.5, ICU4X 0.6 Jan 27, 2022
@sffc sffc removed the help wanted Issue needs an assignee label Jan 27, 2022
@sffc sffc self-assigned this Jan 27, 2022
@sffc
Copy link
Member Author

sffc commented Jan 27, 2022

Similar to #1034; possible duplicate

@sffc
Copy link
Member Author

sffc commented May 25, 2022

This is resolved because we now have a well-documented solution for how to do zero-copy storage of locales.

https://unicode-org.github.io/icu4x-docs/doc/icu_locid/zerovec/index.html

@sffc sffc closed this as completed May 25, 2022
@sffc sffc added the R-obsolete Resolution: This issue is no longer relevant label May 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-locale Component: Locale identifiers, BCP47 R-obsolete Resolution: This issue is no longer relevant S-large Size: A few weeks (larger feature, major refactoring) T-core Type: Required functionality
Projects
None yet
Development

No branches or pull requests

1 participant