Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<format>: Add grapheme clusterization support for width computation #2119

Merged
merged 71 commits into from
Apr 27, 2022

Conversation

barcharcraz
Copy link
Member

@barcharcraz barcharcraz commented Aug 13, 2021

  • Adds three(ish) new iterators, one for codepoint-by-codepoint unicode decoding (utf-8 or 16), one for grapheme clusterization, and one for counting the width of a string. (The reason the third is needed is that for legacy encodings we'd like to only count the width, we have no need to actually decode them to UTF-32.)
  • Uses these in _Measure_string_prefix to count only the first character of each unicode extended grapheme cluster (untailored) as defined in UAX29 and using the Unicode 13 data files (included in tools/unicode_properties_parse). This new processing is only turned on for the "statically UTF-8" case, and not when the system codepage is set to UTF-8, this could be improved in the future, although arguably leads to more consistent behavior.
  • _Fmt_codec is adapted to use the new unicode parsing functions, this makes it more robust against malformed (in particular truncated) UTF-8. _Parse_align still gets only the number of code units in the alignment character, so it will copy an entire malformed subsequence to the format specs, not a U+FFFD replacement (this is not tested).
  • tests for unicode decoding and clusterization. Clusterization test data is again generated from the unicode data files. Note the no-op _Decode_utf function for char32_t inputs, this is to facilitate testing, but could be removed without too much trouble (after modifying the test generator to encode its output).

Future improvements:

  • we could support clusterization for gb18030, especially if set that as the execution charset, doing this would require actually doing unicode decoding for 18030, not decoding it as GBK, which we do now. Also it's not really clear what we should do on invalid data, since sometimes it's impossible to recover from invalid 18030 like it is with UTF-8.
  • The decoding function could surely be faster, but has to be able to handle malformed data correctly, and should be optimized for decoding fairly small quantities of text at once (we absolutely could decode in chunks though)
  • it's possible we should be using the codecvt facets here to do the conversion, but I'm not sure if those do the maximal invalid subsequence replacement thing, which I'd like to do here.
  • it's possible to implement the first set of break rules as a transition table style state-machine, (maybe the others as well, but the table gets much bigger). This is probably a decent idea.
  • If we don't do the above then marking branches as hot or cold in the decode function and the break function may be worthwhile, but this would require vtune investigation.

closes: #1945

@barcharcraz barcharcraz requested a review from a team as a code owner August 13, 2021 02:05
@StephanTLavavej StephanTLavavej added enhancement Something can be improved format C++20/23 format labels Aug 13, 2021
stl/inc/format Outdated Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
tools/validate/validate.cpp Outdated Show resolved Hide resolved
tools/validate/validate.cpp Outdated Show resolved Hide resolved
tools/unicode_properties_parse/GraphemeBreakProperty.txt Outdated Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
@barcharcraz

This comment was marked as resolved.

@StephanTLavavej

This comment was marked as resolved.

stl/CMakeLists.txt Show resolved Hide resolved
stl/CMakeLists.txt Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
@StephanTLavavej
Copy link
Member

⚠️ Note to self:

  1. We'll need MSVC-internal build/setup changes when mirroring this PR.
  2. VS TPN.

@StephanTLavavej StephanTLavavej removed their assignment Apr 23, 2022
Copy link
Contributor

@strega-nil-ms strega-nil-ms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable, just some small nits (and a terrified comment)

stl/inc/format Outdated Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
stl/inc/format Outdated Show resolved Hide resolved
@StephanTLavavej StephanTLavavej self-assigned this Apr 26, 2022
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej dismissed CaseyCarter’s stale review April 27, 2022 01:46

Casey approved internal mirror MSVC-PR-395673, feedback addressed

@StephanTLavavej StephanTLavavej merged commit 9d1421d into microsoft:main Apr 27, 2022
@StephanTLavavej
Copy link
Member

😻 🎉 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Something can be improved format C++20/23 format
Projects
None yet
Development

Successfully merging this pull request may close these issues.

<format>: Grapheme clusterization
8 participants