Design architecture around low-cost locale parsing and storage #958

sffc · 2021-08-17T09:12:12Z

Context: The Locale::from_bytes() function contributes a fair bit of bloat to the ICU4X binary. I am brainstorming ways to make it lighter.

In many cases, we encounter valid BCP-47 strings, and we need to parse them into a Locale.

We parse the language, script, and region subtags into TinyStr fields. However, for variants and Unicode extensions, we almost always need to allocate memory. Why? We could instead just point to substrings of the input BCP-47.

The Locale struct would be something along the lines of:

struct ZeroCopyLocale<'data> {
    language: TinyStr4,
    script: Option<TinyStr4>,
    region: Option<TinyStr4>,
    extras: Option<&'data str>, // or Cow<'data, str>
}

// Example: en-US-posix-u-ca-hebrew
ZeroCopyLocale {
    language: "en",
    script: None,
    region: Some("US"),
    extras: Some("posix-u-ca-hebrew"),
}

Meanwhile, the function to query the variants or extension subtags would essentially perform "on-the-fly" parsing.

Advantages of this model:

Extremely fast and small code to parse BCP-47 (zero memory allocations)
Does not incur the cost of parsing variants and extensions when they aren't needed
Zero-copy compatibility with Yoke/DataPayload
No need to distinguish LangID from Locale

Disadvantages:

Adds a lifetime that wasn't there before (not self-contained)
Lack of mutability / limited use cases

Should we consider doing something like UnicodeSet/UnicodeSetBuilder here? Perhaps we could use the existing code as a LocaleBuilder of sorts, and ZeroCopyLocale could be the lightweight version for runtime use.

CC @zbraniecki @Manishearth

sffc · 2021-08-26T18:54:00Z

@zbraniecki - I'd like to avoid 7 locale classes for 7 different use cases. Both use cases are useful; most use cases won't care about serializing extensions. I think we should be looking more at how to reduce what we store after we finish parsing, since the serialization is rarely used.
@Manishearth - Adding lifetimes to locales will be annoying. Ideally locales are small and easy to move around. One proposal is that we don't parse the whole locale; I think that's a good idea, but it doesn't need to be coupled with zero-copy.
@zbraniecki - Are you worried about objects that have invalid data stored in ICU4X?
@sffc - I envision ZeroCopyLocale as more of a reader of BCP47, not intended for long-term storage. I see it as a useful ephemeral locale parser.
@mihnita - We shouldn't randomly drop things, because it's very error-prone.
@sffc - +1
@nciric - Fuchsia and Dart have their own locale classes. Fuchsia makes heavy use of locale extensions. The one in Dart in particular is very small and fast. Also, even if we do lazy parsing of extensions, we still need to carry the code, right?
@mihnita - In Android, we are looking at storing preferences in the locale identifier (-u-hc, etc). And since locale parsing can be slow, caching things can help.
@zbraniecki - I see two extremes. Either we pick one locale that does what everyone needs, or we could introduce a special crate with 50+ versions to cover different use cases. But one issue is that you might have code written with one locale type, and then the conditions change and you need to change all your code to use a different locale type.
@dminor - We already have a separation between Locale and LocaleCanonicalizer. We have some space where we can shift around some of the validation and canonicalization.
@zbraniecki - There's a tremendous difference between resolving aliases and making a semantically valid locale. I want to avoid us storing semantically invalid locales.
@sffc - I think we should lean more heavily on BCP-47 strings for storage. We can have small no-alloc functions to validate the BCP-47 or to pull a specific subtag out of it.
@nciric - Are we worried about the code size, or the memory size? And in JS we have functions that could accept either a Locale or a String as input.
@zbraniecki - Validation versus portability: if we use lenient string storage... maybe that should be a trait on a string. Because there are some cases where you just care about getting the region, and that could be a trait on the string. You could end up with a class that implements those traits on top of a string.
@Manishearth - Locales are zero-alloc and non-borrowing in 99% of cases, unless there is an extension. With the string-based model, we need to either always alloc or always borrow.
@sffc - I like the direction @zbraniecki suggested.
@mihnita - I like that direction too. Two more points. (1) I might think I don't need something, but maybe some other code I used needs it, which is why you want to generally carry everything with you. (2) In Java, you can create an invalid locale, and carry semantically invalid locales with you, and it's fine. I can call functions to normalize it, but it's fine to move it around. You can move the price around to different areas.

sffc · 2022-01-27T18:22:33Z

Similar to #1034; possible duplicate

sffc · 2022-05-25T23:29:07Z

This is resolved because we now have a well-documented solution for how to do zero-copy storage of locales.

https://unicode-org.github.io/icu4x-docs/doc/icu_locid/zerovec/index.html

sffc added C-locale Component: Locale identifiers, BCP47 discuss Discuss at a future ICU4X-SC meeting labels Aug 17, 2021

sffc mentioned this issue Aug 18, 2021

Determine low-hanging fruit in FixedDecimalFormat FFI example #951

Closed

sffc added S-large Size: A few weeks (larger feature, major refactoring) T-core Type: Required functionality and removed discuss Discuss at a future ICU4X-SC meeting labels Aug 26, 2021

sffc changed the title ~~Why not make a zero-copy Locale?~~ Design architecture around low-cost locale parsing and storage Aug 26, 2021

sffc added this to the ICU4X 0.5 milestone Aug 26, 2021

sffc added the help wanted Issue needs an assignee label Aug 26, 2021

sffc mentioned this issue Sep 5, 2021

Migrate LocaleCanonicalizer data structs to zero-copy #1034

Closed

sffc modified the milestones: ICU4X 0.5, ICU4X 0.6 Jan 27, 2022

sffc removed the help wanted Issue needs an assignee label Jan 27, 2022

sffc self-assigned this Jan 27, 2022

sffc closed this as completed May 25, 2022

sffc added the R-obsolete Resolution: This issue is no longer relevant label May 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design architecture around low-cost locale parsing and storage #958

Design architecture around low-cost locale parsing and storage #958

sffc commented Aug 17, 2021 •

edited

Loading

sffc commented Aug 26, 2021

sffc commented Jan 27, 2022

sffc commented May 25, 2022

Design architecture around low-cost locale parsing and storage #958

Design architecture around low-cost locale parsing and storage #958

Comments

sffc commented Aug 17, 2021 • edited Loading

sffc commented Aug 26, 2021

sffc commented Jan 27, 2022

sffc commented May 25, 2022

sffc commented Aug 17, 2021 •

edited

Loading