Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate LocaleCanonicalizer data structs to zero-copy #1034

Closed
sffc opened this issue Sep 1, 2021 · 30 comments · Fixed by #1777
Closed

Migrate LocaleCanonicalizer data structs to zero-copy #1034

sffc opened this issue Sep 1, 2021 · 30 comments · Fixed by #1777
Assignees
Labels
A-data Area: Data coverage or quality C-locale Component: Locale identifiers, BCP47 S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality

Comments

@sffc
Copy link
Member

sffc commented Sep 1, 2021

Main issue: #856

The LocaleCanonicalizer data structs currently rely on Vec. They should be migrated to ZeroVec, VarZeroVec, or ZeroMap.

https://github.com/unicode-org/icu4x/blob/main/components/locale_canonicalizer/src/provider.rs

I think in most cases ZeroMap serves the use case fairly well. (This would make this issue depend on #844)

For example, Vec<(TinyStr4, LanguageIdentifier)> could become ZeroMap<TinyStr4, LanguageIdentifier> (where LanguageIdentifier needs to support AsVarULE).

Is there a reason you aren't already using LiteMap in your data structure?

CC @dminor @zbraniecki @Manishearth

@sffc sffc added T-core Type: Required functionality C-locale Component: Locale identifiers, BCP47 S-small Size: One afternoon (small bug fix or enhancement) A-data Area: Data coverage or quality labels Sep 1, 2021
@dminor
Copy link
Contributor

dminor commented Sep 1, 2021

I think LiteMap wasn't yet written at the time the likely subtags support was implemented, and I kept Vec for consistency while doing the alias part. I don't see any reason not to use ZeroMap once we're able to.

@sffc
Copy link
Member Author

sffc commented Sep 1, 2021

I have some questions about how you use LanguageIdentifier in LocaleCanonicalizer data.

LanguageIdentifier is often used as the "value" in the key/value maps in both the AliasesV1 and the LikelySubtagsV1 data structs. When I look in the data file, I see things like:

    [
      "arc",
      "Palm",
      "und-SY"
    ],
...
    [
      "zza",
      "und-Latn-TR"
    ]
...

    [
      "und-hepburn-heploc",
      "und-alalc97"
    ],

My questions are:

  1. Would it be consistent with the required data model to represent some of these sets as, e.g., a (TinyStr4, TinyStr4) (for script+region) instead of a LanguageIdentifier?
  2. For cases where that is not possible, would storing a string instead of a LanguageIdentifier be consistent with the required data model?

The reason I'm asking is that we would need to design and implement AsVarULE for LanguageIdentifer, which may be non-trivial and may come at a potential performance cost. If we could represent the data as Copy + 'static types instead, we have more options for optimization when we implement ZeroMap.

@dminor
Copy link
Contributor

dminor commented Sep 2, 2021

This may have improved now, but the last time I tried breaking LanguageIdentifiers up into tuples of TinyStr4, it caused a substantial increase in the disk storage size for the data, both json and bincode.

In general, the algorithms require being able to examine the parts of the LanguageIdentifier. I think a tuple representation would be fine for this, but I'm concerned that a string representation would be quite a bit slower.

This is performance critical code, I don't think we should accept a change that regresses performance. Zibi invested a lot of time in making the likely subtags code fast and the canonicalization code is already substantially slower than the equivalent code in SpiderMonkey, so we don't want to make that even worse. There are benchmarks, it might be possible to quickly test things out and see if it going to have a big impact on performance or not.

@sffc sffc added S-medium Size: Less than a week (larger bug fix or enhancement) and removed S-small Size: One afternoon (small bug fix or enhancement) labels Sep 5, 2021
@sffc
Copy link
Member Author

sffc commented Sep 5, 2021

Thanks for the info.

To be clear, the status quo is that all the LanguageIdentifier strings in the data bundle need to be parsed at data loading time. What we need to figure out is how to store the LanguageIdentifiers in ZeroVec. There are several approaches I can think of:

  1. Use tuples of TinyStr instead of LanguageIdentifier
    • Pro: Easy
    • Con: Impact on data size?
  2. Implement AsVarULE for LanguageIdentifier (keeps the in-file representation as a string, but performs lazy parsing when data is accessed)
    • Pro: Minimal impact on data size
    • Con: Likely to cause measurable performance regressions in the canonicalize function
  3. Design a new Locale-like data structure optimized for storage and retrieval in data files (Design architecture around low-cost locale parsing and storage #958)

Either way, whoever takes this bug will need to do some experimentation and come back with a recommended solution that enables zero-copy data while keeping good performance.

@dminor
Copy link
Contributor

dminor commented Sep 7, 2021

I was afraid you were proposing a design where we'd have to parse the LanguageIdentifier from a string every time we do a canonicalization or likely subtags operation. Parsing once at data loading time is not going to be a problem.

@Manishearth
Copy link
Member

Note that solution 3 involves a little bit of parsing each time, but it should not be expensive

@zbraniecki
Copy link
Member

I'm intuitively in favor of [1].

@sffc sffc added good first issue Good for newcomers help wanted Issue needs an assignee labels Sep 30, 2021
@sffc sffc added this to the ICU4X 0.4 milestone Sep 30, 2021
@sffc
Copy link
Member Author

sffc commented Oct 19, 2021

@pandusonu2 will be working on this.

@sffc sffc removed the help wanted Issue needs an assignee label Oct 19, 2021
@pandusonu2
Copy link
Contributor

Steps discussed:

  • Vec<(K, V)> to LiteMap
  • LanguageIdentifier to custom tuple struct
  • LiteMap to ZeroMap

@pandusonu2
Copy link
Contributor

pandusonu2 commented Dec 2, 2021

From what I understand, the next step is to convert

pub struct LanguageIdentifier {
    pub language: subtags::Language,
    pub script: Option<subtags::Script>,
    pub region: Option<subtags::Region>,
    pub variants: subtags::Variants,
}

to just use everywhere:

(subtags:Languauge, Option<Script>, Option<Region> subtags::Variants)

@Manishearth
Copy link
Member

@sffc okay, that makes sense.

I think (once this is discussed and approved), the steps for @pandusonu2 would be to first restructure LanguageIdentifier so that variants is a single String, and then write the ULE type (perhaps with help from me on getting the impl right, but it's a good chance to see how good our docs are!)

I'm fine with @pandusonu2 starting work on a prototype as well if you think that's a good move.

@sffc
Copy link
Member Author

sffc commented Dec 14, 2021

I don't think we should change LanguageIdentifier as part of this. We should re-parse the variants string to the Vec when converting. It's an edge case when you have more than 1 variant anyway.

Thinking of it, it might be fine to make the ULE be simply the BCP-47 string, and we just do a little simple parsing when accessing a field. We have to do that anyway when using the offset marker. The advantage is that we can have a well-defined as_str().

@sffc
Copy link
Member Author

sffc commented Dec 14, 2021

pub struct LanguageIdentifierStr([u8]);

impl LanguageIdentifierStr {
  pub as_str(&self) -> &str {
    // Safe because of invariants on LanguageIdentifierStr
    str::from_utf8_unchecked(self.0)
  }
}

@Manishearth
Copy link
Member

I think we can just wrap it around str even, and have the wrapper just guarantee that it's been validated to be in the right format for a variants string

@sffc
Copy link
Member Author

sffc commented Dec 14, 2021

Whether we use str or [u8], the ULE impl should check the string for syntax validity.

@sffc
Copy link
Member Author

sffc commented Dec 23, 2021

Discussion:

  • @zbraniecki / @dminor: In LocaleCanonicalizer, we don't carry variants for any of the LanguageIdentifiers.
  • @zbraniecki - The left side of the map is the one that's performance-critical. The right side is less critical.
  • @zbraniecki - I think we should remove all LanguageIdentifier from all locale canonicalizer and replace them with tuples of TinyStr.
  • @sffc - I think that's reasonable and in some cases it could reduce data size.

Conclusion:

  • @pandusonu2 will migrate LikelySubtagsV1 to have only TinyStr tuples
  • @zbraniecki will continue work on AliasesV1

@sffc sffc removed the discuss-priority Discuss at the next ICU4X meeting label Dec 23, 2021
@pandusonu2
Copy link
Contributor

pandusonu2 commented Dec 27, 2021

@sffc, @zbraniecki doubt:

Should und be replaced with tuple of TinyStr as well?
Also, if so, how to implement #[derive(Default)] for tuples here?

@zbraniecki
Copy link
Member

Should und be replaced with tuple of TinyStr as well?

I'd like to maintain the Option<TinyStr4> for it. Is there a reason not to?

@pandusonu2
Copy link
Contributor

Should und be replaced with tuple of TinyStr as well?

I'd like to maintain the Option<TinyStr4> for it. Is there a reason not to?

I was just not sure how #[derive(Default)] works, and that if TinyStr4 would suffice, will update with Option<TinyStr4> for now, and can discuss on the PR if thats good enough I suppose

@sffc
Copy link
Member Author

sffc commented Jan 27, 2022

Blocked on TinyStr migration.

@robertbastian robertbastian removed the blocked A dependency must be resolved before this is actionable label Feb 23, 2022
@pandusonu2
Copy link
Contributor

Maybe blocked on #831

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-data Area: Data coverage or quality C-locale Component: Locale identifiers, BCP47 S-medium Size: Less than a week (larger bug fix or enhancement) T-core Type: Required functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants