Should PartialEq<str> for Locale be strict or lenient? #1709

sffc · 2022-03-17T20:44:08Z

Background: #1704

zbraniecki · 2022-03-18T16:42:21Z

I believe for 1.0 we should have a mental model of how our API communicate different strictness levels of comparison and what is the role of traits such as Eq and PartialEq in this.

For example, we could say that Eq expects case normalization and -, while PartialEq would return true for en-US == eN_us.

We should also consider incomplete matching - en-US vs en-Latn-US or en-US-u-ca-buddhist vs en-US.
In language matching API for l10n needs I added matches method which allows to treat missing fields on either side as "catch-all" or "*".
You can read through the steps here to see how it's used: https://github.com/projectfluent/fluent-langneg-rs/blob/master/src/negotiate/mod.rs#L54

The case normalization vs unnormalized API design is mostly against str, but treatment of missing fields is for types as well. (should en-US-x-foo match en-US-x-bar?)

zbraniecki · 2022-03-18T16:43:17Z

To recap - I don't think we need to implement all of them, but I think before 1.0 we should have a good directions on which API compares how so that as the needs arise we can implement the particular methods/traits consistently in shape that we defined and set prior to 1.0.

sffc · 2022-03-19T02:46:52Z

A few considerations:

My reading of the docs is that PartialEq and PartialOrd must be consistent, and if Eq/Ord is implemented, their behavior must be the same as the PartialEq/PartialOrd behavior.
PartialEq and PartialOrd can be implemented across types, but Eq and Ord are only defined on Rhs = Self.

I tend to have a fairly strong preference that the Self comparisons be strict in all cases where possible, because then we can implement Eq and Ord which enables the use of various Rust data structures.

Across types, my preference is not as strong, but I tend to prefer strict comparison where possible. I think we should offer lenient comparison, but I don't think it's what users should get by default if they don't know better. In most API design cases I've come across, "strict by default" usually ends up being the chosen option.

zbraniecki · 2022-03-19T07:29:19Z

When you say strict I assume you mean case strict. Which is reasonable for me since Locale::from("en-US") == Locale::from("eN_us") will result in the same canonicalized locale.

Across types, I think in particular for comparing locale against a string, I think we should be lenient (diverge from your position) and Locale::from("en-US") == "eN_uS" should return true because I think most users don't care about canonicalization and they believe that differently cased strings still mean "American English".

On the other axis (how to indicate that und-US and en-US shoudl "match"), I lean toward special methods like match/matches.

sffc · 2022-03-19T15:27:43Z

When you say strict I assume you mean case strict.

What I mean is a comparison that satisfies the properties of a "total order", which implies case sensitive, yes.

How would you feel about just not implementing PartialEq across types and then having two methods users can choose from to make the choice between strict and lenient more explicit?

Here are four reasons I tend to prefer strict comparison by default:

Can be used for binary search and other situations that require it
More efficient and smaller code because we don't need special logic to massage the string into the correct form
More well-defined; leniency can be on many levels, but there is only one way for strict to work. For example, leniency could make en-Latn-US equal any subset of the following depending on how lenient we make it: EN-LATN-US, en_Latn_US, en-US-Latn. On this topic, if we decided to change the leniency in the future, it may be a breaking change in behavior of PartialEq. It would be easier to be able to simply add another version of the lenient comparison function.
What I said earlier about strict-by-default being less surprising.

zbraniecki · 2022-03-21T22:52:08Z

How would you feel about just not implementing PartialEq across types and then having two methods users can choose from to make the choice between strict and lenient more explicit?

I feel that if there's no "canonical" expectation, that makes sense - we're effectively kicking developers out of their comfort zone and asking them to explicitly decide.

Below, I'll argue against the assumption that we don't know what is the canonical expectation tho :)

Can be used for binary search and other situations that require it

I think I disagree.

let loc_map: HashMap<Locale, SomeData>;

assert!(loc_map.binary_search("en_US").is_err());

I'd argue that if someone asks for it against a string that they expect en-US to match en_US. Simultaneously I see no scenario where a user would like to pass a string and not match it.
Why would they? To add en_US next to en-US entry? To say "we have no data for en_US"?)

If they really need to, they could check that their string is not canonical locale and reject it prior to matching.

More efficient and smaller code because we don't need special logic to massage the string into the correct form

Agree. The strict vs loose matching can be seen as performance question. I'd argue that by default we should be lenient on input, and allow for "strict_cmp" for cases that need it.

More well-defined;

I don't think your examples are plausible. en-US-Latn is an incorrect locale tag and will reject parsing. EN-LATN-US and en_Latn_US will both parse. I believe case-insensitive + -/_ is the lenient and it won't change.

What I said earlier about strict-by-default being less surprising.

I think it's a subjective argument. For one person it may be surprising to see loc!("en-US") == "en_US" for another it may be surprising to see it not match.

I think if we embrace "lenient on input, strict on output" then we should be lenient here. If we reject it, we should be strict and be very clear in docs that any string used for comparison tests against any component of a locale has to be canonicalized before for the matching to work.

We could also do any of the:

PartialEq<str> is strict, and component.matches(str) is lenient
PartialEq<strin> is lenient and component.matches_strict(str) is strict

and then separate matrix for the meaning of missing components (und-US matches en-US, en-US matches en-Latn-US etc.)

sffc · 2022-03-22T02:28:02Z

Can be used for binary search and other situations that require it

I think I disagree.

Rust binary_search is defined only for Ord types, and Ord is defined to mean types that have a "total order".

Is binary_search well-defined on non-total-order impls? If so, then why does the standard library require Ord instead of PartialOrd?

More efficient and smaller code because we don't need special logic to massage the string into the correct form

Agree. The strict vs loose matching can be seen as performance question. I'd argue that by default we should be lenient on input, and allow for "strict_cmp" for cases that need it.

Ok. My position, weakly held, is that the thing that is better for the ICU4X value proposition (performance, memory usage, or in this case code size) should be the default behavior.

One could make the argument that lenient comparison is better for the "i18n correctness" value proposition, an argument with which I disagree, since we should be expressly encouraging people to use BCP-47 language tags, rather than being lenient by default and allowing the old format to continue to exist and be supported in the wild.

More well-defined;

I don't think your examples are plausible. en-US-Latn is an incorrect locale tag and will reject parsing. EN-LATN-US and en_Latn_US will both parse. I believe case-insensitive + -/_ is the lenient and it won't change.

Does UTS-35 define exactly what sets of strings are considered valid but non-canonical?

I'd like to draw a corollary with the related issue of Date.parse() in ECMAScript. ECMA-262 has long said that Date.parse() has lenient parsing based on implementation-defined behavior. What has happened is that different browser engines have defined the lenient parsing differently, and there's now a big miss-mash of what Date.parse() can and cannot do. The Temporal proposal resolves this issue by only parsing ISO-8601 strings.

What I said earlier about strict-by-default being less surprising.

I think it's a subjective argument. For one person it may be surprising to see loc!("en-US") == "en_US" for another it may be surprising to see it not match.

My position here is that if users require lenient behavior, they will discover fairly quickly that the default impl is not lenient through testing. However, if the users require strict behavior, they may never find out that they are using the less-efficient lenient method, since the lenient method works in all the cases that the strict method works. It is therefore less likely to cause surprise if strict is default.

zbraniecki · 2022-03-22T02:33:32Z

ECMA-262 has long said that Date.parse() has lenient parsing based on implementation-defined behavior.

I see the problem in the fact that they let it be "implementation-defined". We would be more strict allowing only for case-insensitive and -/_.

However, if the users require strict behavior,

I struggle to imagine a case where a user would like the behavior of matching a Locale to a stringified version of it to be strict.

sffc · 2022-03-22T02:41:27Z

I struggle to imagine a case where a user would like the behavior of matching a Locale to a stringified version of it to be strict.

If the user has a list of strings that they know are definitely BCP-47 canonical, and they don't want to carry the extra code size / performance cost of "canonicalization on the fly". I have such a case in the data provider.

sffc · 2022-03-22T02:45:23Z

ECMA-262 has long said that Date.parse() has lenient parsing based on implementation-defined behavior.

I see the problem in the fact that they let it be "implementation-defined". We would be more strict allowing only for case-insensitive and -/_.

If we're clear that these are the only exceptions, then this satisfies argument 3. Argument 2 is weakened, except for my side-note about the negative impact on the ecosystem of allowing the old stuff to propogate. Arguments 1 and 4 still stand.

zbraniecki · 2022-03-22T07:15:57Z

If the user has a list of strings that they know are definitely BCP-47 canonical, and they don't want to carry the extra code size / performance cost of "canonicalization on the fly". I have such a case in the data provider.

I wasn't explicit enough. I struggle to find a use case where a consumer of our library who is not very familiar with the nuances of the canonicalization and wants to "just use" Locale class would use the convenience PartialEq trait and be disappointed that it matches en-US to en_US.

I imagine a special-case consumer who is very performance/code-size conscious and they would want to use matches_strict or something like that.

Arguments 1 and 4 still stand.

My responses still stand as well.

Anyhoo. I'm 0.2 on apache scale and would prefer not to invest much more time in bikeshedding since we seem to have our arguments laid out and I doubt we are in a position to convince one another.

Maybe we can discuss it in icu4x bi-weekly, vote and move on.

dminor · 2022-05-26T13:25:26Z

Marking this as discuss-priority so we can can come to a decision on this at the next meeting.

sffc · 2022-05-26T18:09:37Z

Discussion:

@Manishearth - In general, PartialEq should generally not be doing magic stuff, so if it exists, it should generally be strict, but I'd prefer not having it for str.
@robertbastian - Having PartialEq<str> is confusing, anyway, because it's not clear what logic it uses
@nordzilla - Does it affect functions that assume you have an impl? I guess most of those let you pass a closure instead.

Proposal: Do not have PartialEq<str> and only have methods.

LGTM: @echeran @sffc @Manishearth @robertbastian @dminor

sffc · 2022-05-26T18:21:46Z

Naming for the two functions:

cmp_bytes is currently a strict comparison against &[u8].

New names:

normalizing_eq(&str) -> bool
strict_cmp(&[u8]) -> Ordering

LGTM: @QnnOkabayashi @Manishearth @robertbastian @nordzilla @zbraniecki

In the future we could add:

normalizing_cmp(&str) -> Ordering
strict_eq(&[u8]) -> bool

robertbastian · 2022-06-08T00:27:54Z

Should we also remove PartialEq<str> from the subtags? LocaleCanonicalizer does a lot of subtag/str comparisons and I think it'd be better to make it explicit whether they're strict or normalizing.

sffc · 2022-06-08T00:59:47Z

I believe so, yes. I expect that the use cases in LocaleCanonicalizer might benefit anyway from using more [u8] rather than str for additional deserialization performance gains.

We should use strict comparison for data coming from the DataProvider, because we expect the data to have been pre-normalized in datagen.

robertbastian · 2022-06-08T02:25:12Z

LocaleCanonicalizer's data is mainly unvalidated TinyStrs, and its input are valid Locales, but somehow the logic ends up comparing &strs internally.

sffc · 2022-06-08T02:48:34Z

Oh right, it uses TinyStr because the subtags might not all be the same length. You can get the &[u8] from the TinyStr.

sffc added C-locale Component: Locale identifiers, BCP47 discuss Discuss at a future ICU4X-SC meeting v1 labels Mar 17, 2022

sffc added this to the ICU4X 1.0 milestone Apr 1, 2022

sffc removed the v1 label Apr 1, 2022

sffc mentioned this issue Apr 2, 2022

When icu::locid::extensions::unicode::Value has a single-segment value, make it borrowable as &str #1764

Closed

dminor added discuss-priority Discuss at the next ICU4X meeting and removed discuss Discuss at a future ICU4X-SC meeting labels May 26, 2022

sffc added T-core Type: Required functionality S-small Size: One afternoon (small bug fix or enhancement) and removed discuss-priority Discuss at the next ICU4X meeting labels May 26, 2022

sffc assigned robertbastian May 26, 2022

sffc modified the milestones: ICU4X 1.0 Untriaged, ICU4X 1.0 (Features) May 26, 2022

robertbastian mentioned this issue Jun 7, 2022

Update Locale and LanguageIdentifier comparison names #2020

Merged

sffc closed this as completed in #2020 Jun 8, 2022

robertbastian mentioned this issue Jun 13, 2022

normalizing_eq, strict_cmp for LSRV subtags #2048

Merged

sffc mentioned this issue Jun 1, 2023

How to implement ordering on AnyCalendarInner? #3469

Open

sffc mentioned this issue Jan 31, 2024

Consider support of old locale extension syntax in locale parsing #4566

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should PartialEq<str> for Locale be strict or lenient? #1709

Should PartialEq<str> for Locale be strict or lenient? #1709

sffc commented Mar 17, 2022

zbraniecki commented Mar 18, 2022

zbraniecki commented Mar 18, 2022

sffc commented Mar 19, 2022

zbraniecki commented Mar 19, 2022

sffc commented Mar 19, 2022

zbraniecki commented Mar 21, 2022

sffc commented Mar 22, 2022 •

edited

Loading

zbraniecki commented Mar 22, 2022

sffc commented Mar 22, 2022 •

edited

Loading

sffc commented Mar 22, 2022 •

edited

Loading

zbraniecki commented Mar 22, 2022

dminor commented May 26, 2022

sffc commented May 26, 2022

sffc commented May 26, 2022 •

edited

Loading

robertbastian commented Jun 8, 2022

sffc commented Jun 8, 2022

robertbastian commented Jun 8, 2022

sffc commented Jun 8, 2022

Should PartialEq<str> for Locale be strict or lenient? #1709

Should PartialEq<str> for Locale be strict or lenient? #1709

Comments

sffc commented Mar 17, 2022

zbraniecki commented Mar 18, 2022

zbraniecki commented Mar 18, 2022

sffc commented Mar 19, 2022

zbraniecki commented Mar 19, 2022

sffc commented Mar 19, 2022

zbraniecki commented Mar 21, 2022

sffc commented Mar 22, 2022 • edited Loading

zbraniecki commented Mar 22, 2022

sffc commented Mar 22, 2022 • edited Loading

sffc commented Mar 22, 2022 • edited Loading

zbraniecki commented Mar 22, 2022

dminor commented May 26, 2022

sffc commented May 26, 2022

sffc commented May 26, 2022 • edited Loading

robertbastian commented Jun 8, 2022

sffc commented Jun 8, 2022

robertbastian commented Jun 8, 2022

sffc commented Jun 8, 2022

sffc commented Mar 22, 2022 •

edited

Loading

sffc commented Mar 22, 2022 •

edited

Loading

sffc commented Mar 22, 2022 •

edited

Loading

sffc commented May 26, 2022 •

edited

Loading