Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate removing scalar value validation from HarfBuzz canonical composition and decomposition #2840

Open
hsivonen opened this issue Nov 22, 2022 · 8 comments
Assignees
Labels
A-performance Area: Performance (CPU, Memory) C-collator Component: Collation, normalization

Comments

@hsivonen
Copy link
Member

The initial glue code for providing the canonical composition and canonical decomposition operation to HarfBuzz assumes that the incoming hb_codepoint_t values might not be valid scalar values.

We should investigate avoiding the validation overhead by:

  1. Investigating what HarfBuzz itself guarantees.
  2. Considering making the relevant ICU4X APIs have u32-argument alternatives to the char-argument methods.
@hsivonen
Copy link
Member Author

For decomposition, it should be fine to make the ICU4X code operate on u32 and to make the current char-argument version a convenience wrapper on top of that.

For composition, ICU4X internals seem sufficiently reliant on having assurances of scalar valueness that either we should validate the inputs provided by HarfBuzz or take the position that if HarfBuzz passes non-scalar values that's the kind of error in using the HarfBuzz API that UB is acceptable.

@hsivonen
Copy link
Member Author

or take the position that if HarfBuzz passes non-scalar values that's the kind of error in using the HarfBuzz API that UB is acceptable.

The HarfBuzz API for bypassing input validation now says the caller is responsible for passing in scalar values into HarfBuzz: harfbuzz/harfbuzz@f2297e6

We could now take the position that calling hb_buffer_add_codepoints is conceptually like calling Rust's core::str::from_utf8_unchecked.

@sffc
Copy link
Member

sffc commented Dec 22, 2022

I don't think it's worth over-engineering this to avoid a validation which is essentially free.

CC @echeran

@sffc sffc added the needs-approval One or more stakeholders need to approve proposal label Dec 22, 2022
@echeran
Copy link
Contributor

echeran commented Jan 21, 2023

...either we should validate the inputs provided by HarfBuzz or take the position that if HarfBuzz passes non-scalar values that's the kind of error in using the HarfBuzz API that UB is acceptable.

That sounds reasonable. As I see it, either way, it's not an error that is actionable by the user, so UB without crashing seems acceptable indeed. FWIW, I think this is also in line with the proposed rules of thumb in the data safety PR.

We could now take the position that calling hb_buffer_add_codepoints is conceptually like calling Rust's core::str::from_utf8_unchecked.

That makes sense. And at least we're passing Rust chars, so we're taking care of the newly clarified requirement for callers.

@sffc
Copy link
Member

sffc commented Feb 2, 2023

@hsivonen Thoughts on the latest comments from myself and @echeran?

@sffc
Copy link
Member

sffc commented Feb 16, 2023

Ping @hsivonen

@hsivonen
Copy link
Member Author

hsivonen commented Dec 1, 2023

Sorry about missing that this was blocking on me. My preference is that we trust that scalar values that we receive from HarfBuzz are indeed scalar values and, therefore, can be interpreted as Rust char without validation.

@sffc
Copy link
Member

sffc commented Dec 1, 2023

Ok, seems reasonable. In any case I'd prefer to fix this in the harfbuzz glue code in the upstream repo so that we avoid interacting with raw harfbuzz types in the icu repo. #3257

@sffc sffc self-assigned this Dec 1, 2023
@sffc sffc added this to the 1.x Priority ⟨P2⟩ milestone Dec 1, 2023
@sffc sffc removed the needs-approval One or more stakeholders need to approve proposal label Dec 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-performance Area: Performance (CPU, Memory) C-collator Component: Collation, normalization
Projects
None yet
Development

No branches or pull requests

3 participants