<format>: Add grapheme clusterization support for width computation #2119

barcharcraz · 2021-08-13T02:05:22Z

Adds three(ish) new iterators, one for codepoint-by-codepoint unicode decoding (utf-8 or 16), one for grapheme clusterization, and one for counting the width of a string. (The reason the third is needed is that for legacy encodings we'd like to only count the width, we have no need to actually decode them to UTF-32.)
Uses these in _Measure_string_prefix to count only the first character of each unicode extended grapheme cluster (untailored) as defined in UAX29 and using the Unicode 13 data files (included in tools/unicode_properties_parse). This new processing is only turned on for the "statically UTF-8" case, and not when the system codepage is set to UTF-8, this could be improved in the future, although arguably leads to more consistent behavior.
_Fmt_codec is adapted to use the new unicode parsing functions, this makes it more robust against malformed (in particular truncated) UTF-8. _Parse_align still gets only the number of code units in the alignment character, so it will copy an entire malformed subsequence to the format specs, not a U+FFFD replacement (this is not tested).
tests for unicode decoding and clusterization. Clusterization test data is again generated from the unicode data files. Note the no-op _Decode_utf function for char32_t inputs, this is to facilitate testing, but could be removed without too much trouble (after modifying the test generator to encode its output).

Future improvements:

we could support clusterization for gb18030, especially if set that as the execution charset, doing this would require actually doing unicode decoding for 18030, not decoding it as GBK, which we do now. Also it's not really clear what we should do on invalid data, since sometimes it's impossible to recover from invalid 18030 like it is with UTF-8.
The decoding function could surely be faster, but has to be able to handle malformed data correctly, and should be optimized for decoding fairly small quantities of text at once (we absolutely could decode in chunks though)
it's possible we should be using the codecvt facets here to do the conversion, but I'm not sure if those do the maximal invalid subsequence replacement thing, which I'd like to do here.
it's possible to implement the first set of break rules as a transition table style state-machine, (maybe the others as well, but the table gets much bigger). This is probably a decent idea.
If we don't do the above then marking branches as hot or cold in the decode function and the break function may be worthwhile, but this would require vtune investigation.

closes: #1945

stl/inc/format

tools/unicode_properties_parse/grapheme_break_property_data_gen.py

stl/inc/format

tests/std/tests/P0645R10_text_formatting_grapheme_clusterization/test.cpp

stl/inc/format

tools/validate/validate.cpp

tools/unicode_properties_parse/grapheme_break_test_data_gen.py

tools/unicode_properties_parse/GraphemeBreakProperty.txt

tools/unicode_properties_parse/grapheme_break_property_data_gen.py

stl/inc/format

…r iterator.

tools/unicode_properties_parse/.gitignore

tests/std/tests/P0645R10_text_formatting_grapheme_clusterization/env.lst

stl/CMakeLists.txt

stl/inc/format

tools/unicode_properties_parse/grapheme_break_test_data_gen.py

tests/std/tests/P0645R10_text_formatting_grapheme_clusterization/test.cpp

StephanTLavavej · 2022-04-23T03:12:10Z

⚠️ Note to self:

We'll need MSVC-internal build/setup changes when mirroring this PR.
VS TPN.

strega-nil-ms

Looks reasonable, just some small nits (and a terrified comment)

stl/inc/format

tools/unicode_properties_parse/grapheme_break_test_data_gen.py

StephanTLavavej · 2022-04-26T21:26:03Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

Casey approved internal mirror MSVC-PR-395673, feedback addressed

StephanTLavavej · 2022-04-27T01:49:33Z

😻 🎉 🚀

barcharcraz requested a review from a team as a code owner August 13, 2021 02:05

StephanTLavavej added enhancement Something can be improved format C++20/23 format labels Aug 13, 2021

statementreply reviewed Aug 20, 2021

View reviewed changes

stl/inc/format Outdated Show resolved Hide resolved

stl/inc/format Outdated Show resolved Hide resolved

stl/inc/format Outdated Show resolved Hide resolved

stl/inc/format Outdated Show resolved Hide resolved

CaseyCarter reviewed Aug 20, 2021

View reviewed changes

tools/unicode_properties_parse/grapheme_break_property_data_gen.py Show resolved Hide resolved

AdamBucior reviewed Aug 20, 2021

View reviewed changes

StephanTLavavej requested changes Aug 21, 2021

View reviewed changes

barcharcraz added 20 commits September 13, 2021 17:51

Add generator for unicode static data and add static data to <format>

23c009a

work on utf-8 conversion function that can deal with invalid unicode

ac723c9

work on utf8 error handling decoder.

5fc54a9

utf-8 decoder works

7dc4b9a

add _Unicode_codepoint_iterator, to be wrapped by the grapheme cluste…

8a3f277

…r iterator.

unicode iterator actually works

697ca14

in progress break iterator op++

87c02c8

add grapemem test data gen and code generator

6108f9c

minor comment revisions.

c883233

clusterization tests pass

79c6e36

start work on porting c++ data generator to python.

a87dd27

add python data generator

98033c5

remove the c++ data generator.

ed69f8a

small comment correction

a1b3a2d

add grapheme clusterization

348e217

add license header

3ee49f4

constexpr gb11 regex.

416866f

line length in data generator.

8726535

teach validate to ignore unicode data files.

a3a6213

use lower bound instead of upper bound, as lower bound is in xutil.

5ee59cb

barcharcraz added 2 commits March 31, 2022 12:08

remove bad unicode characters from comments and break some long lines.

572b809

disable clang-format for the _entire_ .gitignore file.

5a3c30c

This comment was marked as resolved.

Sign in to view

StephanTLavavej unassigned barcharcraz Apr 1, 2022

cpplearner reviewed Apr 2, 2022

View reviewed changes

tools/unicode_properties_parse/.gitignore Outdated Show resolved Hide resolved

StephanTLavavej assigned barcharcraz and StephanTLavavej and unassigned barcharcraz Apr 6, 2022

StephanTLavavej added 2 commits April 21, 2022 20:22

Merge branch 'main' into format_uax29

11b726f

Teach parallelize.cpp to skip dotfiles.

0516c26

This comment was marked as resolved.

Sign in to view

StephanTLavavej reviewed Apr 23, 2022

View reviewed changes

Code review feedback.

1c4217e

StephanTLavavej approved these changes Apr 23, 2022

View reviewed changes

StephanTLavavej removed their assignment Apr 23, 2022

strega-nil-ms approved these changes Apr 23, 2022

View reviewed changes

stl/inc/format Outdated Show resolved Hide resolved

stl/inc/format Outdated Show resolved Hide resolved

stl/inc/format Outdated Show resolved Hide resolved

stl/inc/format Outdated Show resolved Hide resolved

barcharcraz commented Apr 23, 2022

View reviewed changes

tools/unicode_properties_parse/grapheme_break_test_data_gen.py Outdated Show resolved Hide resolved

Code review feedback from Nicole and Charlie.

48a2e10

StephanTLavavej approved these changes Apr 23, 2022

View reviewed changes

StephanTLavavej self-assigned this Apr 26, 2022

StephanTLavavej merged commit 9d1421d into microsoft:main Apr 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

<format>: Add grapheme clusterization support for width computation #2119

<format>: Add grapheme clusterization support for width computation #2119

barcharcraz commented Aug 13, 2021 •

edited by CaseyCarter

Loading

This comment was marked as resolved.

This comment was marked as resolved.

StephanTLavavej commented Apr 23, 2022

strega-nil-ms left a comment

StephanTLavavej commented Apr 26, 2022

StephanTLavavej commented Apr 27, 2022

<format>: Add grapheme clusterization support for width computation #2119

<format>: Add grapheme clusterization support for width computation #2119

Conversation

barcharcraz commented Aug 13, 2021 • edited by CaseyCarter Loading

This comment was marked as resolved.

This comment was marked as resolved.

StephanTLavavej commented Apr 23, 2022

⚠️ Note to self:

strega-nil-ms left a comment

Choose a reason for hiding this comment

StephanTLavavej commented Apr 26, 2022

StephanTLavavej commented Apr 27, 2022

😻 🎉 🚀

barcharcraz commented Aug 13, 2021 •

edited by CaseyCarter

Loading