Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide a 3-byte type with GIGO convertibility to/from char in zerovec #1968

Closed
hsivonen opened this issue May 30, 2022 · 16 comments · Fixed by #3444
Closed

Provide a 3-byte type with GIGO convertibility to/from char in zerovec #1968

hsivonen opened this issue May 30, 2022 · 16 comments · Fixed by #3444
Assignees
Labels
C-zerovec Component: Yoke, ZeroVec, DataBake good first issue Good for newcomers help wanted Issue needs an assignee S-small Size: One afternoon (small bug fix or enhancement) T-enhancement Type: Nice-to-have but not required
Milestone

Comments

@hsivonen
Copy link
Member

Please provide a type in zerovec that has these properties:

  • Stored as 3 bytes.
  • Converts from char.
  • Converts to char such that if the value is outside the scalar value range, panics if debug assertions are enabled and returns U+FFFD if debug assertions are not enabled.

Use case: Storing non-BMP normalization expansions without wasting a byte per scalar value.

@hsivonen
Copy link
Member Author

CC @Manishearth

@Manishearth
Copy link
Member

Manishearth commented May 31, 2022

This feels rather specific to the use case, I don't know if this belongs in zerovec but it can be designed pretty easily outside of zerovec, you just have to implement AsULE and ULE appropriately. Happy to provide guidance on that.

In general i intend zerovec to be a generally useful crate, not one specific for i18n, and as such I view APIs under that lens when considering them for inclusion.

@sffc
Copy link
Member

sffc commented May 31, 2022

This seems like a reasonable addition. I would say that this should be on the general type CharULE.

This can be done in two phases:

  1. Change CharULE to from 4 bytes to 3 bytes
  2. Change the validation behavior to use U+FFFD

We should do (1) in the short term, before 1.0, since it affects data file stability. This should be a fairly easy change to make. We can do (2) later, since it only affects runtime behavior.

@sffc sffc added good first issue Good for newcomers help wanted Issue needs an assignee T-enhancement Type: Nice-to-have but not required labels May 31, 2022
@Manishearth
Copy link
Member

So I'm very happy to have (1), I'm not as sure if we should have U+FFFD validation behavior. It seems a bit strange to me and rather specific, given that zerovec tends to avoid GIGO overall.

@sffc sffc added this to the ICU4X 1.1 milestone May 31, 2022
@sffc sffc added C-data-infra Component: provider, datagen, fallback, adapters S-small Size: One afternoon (small bug fix or enhancement) labels May 31, 2022
@sffc
Copy link
Member

sffc commented May 31, 2022

I split (1) into a separate issue in the 1.0 Polish milestone: #1970

We can keep this issue open for a future milestone to further discuss (2).

@sffc sffc added the discuss Discuss at a future ICU4X-SC meeting label May 31, 2022
@Manishearth
Copy link
Member

sounds good!

@hsivonen
Copy link
Member Author

Thanks. 3 bytes per char is more important for my use case than whether the value space is checked on access (and mapped to U+FFFD) or checked at deserialization time (and fails deserialization) assuming that the deserialization check happens at most once over the lifetime of the process and doesn't happen on every normalizer or collator instantiation.

@hsivonen
Copy link
Member Author

(I would also take a U24 that would just be a 3-byte integer.)

@Manishearth
Copy link
Member

happens at most once over the lifetime of the process and doesn't happen on every normalizer or collator instantiation

(as we've mentioned before this is entirely dependent on the caching strategy/etc used but we do plan to support the ability to do this)

@Manishearth
Copy link
Member

the u24 can be done without any unsafe by using RawBytesULE<3> and a U24 type that contains that type as its AsULE, with convenient getters

@hsivonen
Copy link
Member Author

hsivonen commented Jun 2, 2022

I implemented U24 for my immediate use case in #1967, but afterwards it could be hoisted into some place more generic.

@robertbastian
Copy link
Member

I think it'd be confusing for CharULE to not work for all chars, which is what I understand would happen with @sffc's suggestion.

Wouldn't it be cleaner to define another type like BmpCharULE? This could live outside the zerovec crate so we could do all the GIGO we want.

@hsivonen
Copy link
Member Author

hsivonen commented Jun 7, 2022

I think it'd be confusing for CharULE to not work for all chars, which is what I understand would happen with @sffc's suggestion.

3 bytes in enough for any char.

Wouldn't it be cleaner to define another type like BmpCharULE? This could live outside the zerovec crate so we could do all the GIGO we want.

It would make sense to have a BmpChar as well. In particular, it would be extremely weird if the supplementary characters in normalizer data were validated at deserialization time but the BMP characters weren't.

@robertbastian
Copy link
Member

Oh right it doesn't use UTF-8 internally. However, doesn't making CharULE a different size to char break things like storing &[T::ULE] and then interpreting that as &[T]?

@sffc
Copy link
Member

sffc commented Jun 7, 2022

Oh right it doesn't use UTF-8 internally. However, doesn't making CharULE a different size to char break things like storing &[T::ULE] and then interpreting that as &[T]?

This is not something that ZeroVec supports for any type (except implicitly by those where T::ULE == T).

ZeroVec has from_slice_or_alloc to map &[T] to &[T::ULE] based on EqULE, which will no longer be supported for char, but this is a methodd I want to phase out anyway: #1935

@sffc
Copy link
Member

sffc commented Dec 16, 2022

Discussion with @robertbastian @Manishearth @sffc:

  • char should remain a deserialization-time validated type
  • We can/should introduce UnvalidatedChar([u8; 3]), whose ULE is RawBytesULE<3>, which has functions to convert to char with various types of behavior
  • No consensus on where this type should live; maybe it should live next to UnvalidatedStr, but we're not sure that's in the best place either
  • Decision: Implement it next to UnvalidatedStr, but re-evaluate the location of both types before Utils 1.0

@sffc sffc removed the discuss Discuss at a future ICU4X-SC meeting label Dec 16, 2022
@sffc sffc added C-zerovec Component: Yoke, ZeroVec, DataBake and removed C-data-infra Component: provider, datagen, fallback, adapters labels Dec 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-zerovec Component: Yoke, ZeroVec, DataBake good first issue Good for newcomers help wanted Issue needs an assignee S-small Size: One afternoon (small bug fix or enhancement) T-enhancement Type: Nice-to-have but not required
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants