Skip to content

[stdlib] Export grapheme breaking facility #62794

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jan 5, 2023

Conversation

lorentey
Copy link
Member

@lorentey lorentey commented Dec 31, 2022

Unicode._CharacterRecognizer is a newly exported opaque type that exposes the stdlib’s extended grapheme cluster breaking facility, independent of String.

This essentially makes the underlying simple state machine public, without exposing any of the (unstable) Unicode details.

The ability to perform grapheme breaking over, say, the scalars stored in multiple String values can be extremely useful while building custom text processing algorithms and data structures.

Ideally this would eventually become API, but before proposing this to Swift Evolution, I’d like to prove the shape of the type in actual use (and we’ll also need to find better names for its operations).

rdar://103903565

Instead of just checking the number of breaks in each test case,
expose and check the actual positions of those breaks, too.
This turns _GraphemeBreakingState into a more proper state
machine, although it is only able to recognize breaks in the
forward direction.

The backward direction requires arbitrarily long lookback,
and it currently remains in _StringGuts.
`Unicode._CharacterRecognizer` is a newly exported opaque type that
exposes the stdlib’s extended grapheme cluster breaking facility,
independent of `String`.

This essentially makes the underlying simple state machine public,
without exposing any of the (unstable) Unicode details.

The ability to perform grapheme breaking over, say, the scalars stored
in multiple `String` values can be extremely useful while building
custom text processing algorithms and data structures.

Ideally this would eventually become API, but before proposing this
to Swift Evolution, I’d like to prove the shape of the type in actual
use (and we’ll also need to find better names for its operations).
@lorentey lorentey requested a review from Azoy December 31, 2022 00:33
@lorentey
Copy link
Member Author

@swift-ci test

@karwa
Copy link
Contributor

karwa commented Jan 1, 2023

I’d like to prove the shape of the type in actual use (and we’ll also need to find better names for its operations).

I suspect that, in addition to detecting breaks, we'll want APIs that accept UTF8 (since that is what people are more likely to have on hand, rather than collections of scalars), and also APIs that vend Characters (I note that there is no direct API currently for creating a Character from scalars or UTF8 - and even if there were, it would need to perform exactly this check internally).

If we combine those, we may be able to cut a fair bit of overhead. Consuming and validating some UTF8 source, finding the grapheme break, and, since we'll already have validated UTF8 available from the source, we can allocate precisely-sized String storage and copy the code-units in to it, yielding the result as a Character.

It's quite interesting and difficult to design, because there's a matrix of source/destination types, and all of them are potentially useful to somebody, and there are all kins of opportunities to remove validation and conversion overheads by doing more in a single operation.

@lorentey
Copy link
Member Author

lorentey commented Jan 2, 2023

This is supposed to be a core abstraction for Unicode processing that is layered strictly below String. Essentially, it enables code outside of the Standard Library to experiment with advanced constructs such as alternative String types, without the huge headache of having to roll their own grapheme breaking algorithm.

The idea here is that the low-level interface should concentrate entirely on deciding where the boundaries are, without assuming anything about the underlying representation of the data in memory, and with results that are 100% consistent with what String is doing. If a project has a bunch of text data in a series of fixed-size UTF-8 buffers, and they want to copy each grapheme cluster into something silly like an array of Character values, then they can simply use this type to find the boundaries and then trivially create the Characters on their own. If a project just wants to count characters in, say, a single UTF-32 encoded buffer, then it's also trivial to do that, while avoiding the overhead of unnecessary mallocs and copying.

Organizing things around functions that return individual Character values is not a recipe for great performance, and I think we'll definitely want to avoid doing anything like that in this context. We need to avoid allocations as much as possible -- copying scalars into a Character value would generally need to allocate backing store for them, which would be very undesirable. Not to mention that using Character would elevate this type to the same abstraction level as String, rendering it largely pointless.

I agree string data is best kept in (piecewise contiguous) UTF-8 encoded buffers. However, UTF-16 will remain unavoidable in the near term, and so I strongly believe that this layer needs to support both encodings, equally. The easiest way to do that is to express the grapheme breaking algorithm in terms of Unicode scalars. This also conveniently matches the way this algorithm is described in Unicode, making it easy to reason about the implementation.

With all that said, it would indeed be really nice to also implement grapheme breaking directly on piecewise contiguous UTF-8 or UTF-16 buffers, as a shortcut for the common case. However, this would be a separate multi-month project, with an unclear payoff: I don't know how much faster I could expect a direct implementation to be. (E.g., does it make sense to express grapheme breaking as a SIMD algorithm? Even if it is, I doubt it would be easy -- even if we figured out what to do about the scalar-keyed lookup tables for Unicode properties, the average grapheme cluster tends to be tiny, so we'd need to allow processing many characters in a single batch, severely complicating the APIs.) Additionally, adding separate, highly optimized grapheme breaking code paths would exponentially increase the repeating maintenance burden of dealing with potential changes in future versions of Unicode -- which is already quite a headache as it is.

Right now I'm happy to keep things simple & stupid -- so I want to just expose the current algorithm as is. The algorithm operates on pairs of Unicode scalars, so that's the obvious type for the input values. Folks with UTF-8 or UTF-16 data can simply decode it into scalars before feeding it to this construct -- exactly like String does.

FWIW, I don't think we are anywhere close to exhausting all options to optimize this algorithm -- there is plenty of performance work to be done even before we consider switching away from Unicode scalars.

@karwa
Copy link
Contributor

karwa commented Jan 2, 2023

Right now I'm happy to keep things simple & stupid -- so I want to just expose the current algorithm as is. The algorithm operates on pairs of Unicode scalars, so that's the obvious type for the input values. Folks with UTF-8 or UTF-16 data can simply decode it into scalars before feeding it to this construct -- exactly like String does.

Sure. My suggestion was for the future, as you consider what shape this API should have. It's unlikely that people are storing significant amounts of text as scalars, so they'll need to do the decoding manually (which is inconvenient), and if we ever do add fast paths, they won't automatically benefit.

As for Character, the idea is that if you want to pass these grapheme clusters on to another processing stage, which type would you use for it?

  1. An array of scalars. Since your text is unlikely to be stored as scalars, you'll need to allocate an Array to buffer the content you decoded - so this has the same problems WRT allocations as Character, except it's worse because there's no small-array optimisation like we have for string storage (and as you note, most grapheme clusters will be tiny).
  2. Track the indexes from your original code-unit buffer and vend a slice. In this case, you won't be able to use the standard library's UTF8.decode or UTF16.decode methods, since they consume an iterator. There's no way to know which code-units the decoded scalars represent.
    Also, slices of code-units do not have access to other Unicode APIs such as canonical equivalence, character classes, etc. Further processing may be awkward.

So if you want to build composable algorithms based on grapheme clusters, working in terms of Character doesn't seem like a terrible idea. And if we combine it with the idea of allowing UTF8 to be given as the input directly, we could skirt the limitations of the existing UnicodeCodec API and implement idea (2) internally, so Character construction is as low-overhead as possible.

Anyway, I 100% agree that this API which finds breaks in a stream of scalars should be exposed.

@lorentey
Copy link
Member Author

lorentey commented Jan 2, 2023

As for Character, the idea is that if you want to pass these grapheme clusters on to another processing stage, which type would you use for it?

This depends on the level of abstraction we're targeting.

If the goal is to solve small, everyday text processing problems, then String and Character are likely to be the right choice, and the Unicode namespace is best ignored altogether.

For the rare problems that aren't suited to these abstractions, representing arbitrary grapheme clusters with self-contained value types is generally a bad idea. Using index ranges into the underlying bulk storage is likely to be the only reasonable way to do this. (In fact, even for small problems, Character is often just a convenient nuisance rather than a truly helpful abstraction -- Substring seems closer to the the actual spirit of a grapheme cluster.)

Getting a position out of an iterator is not a real problem for the target audience of these entry points -- having to define a custom iterator is a mild inconvenience, not a road block. However, the standard library's Unicode decoder APIs suffer from an overuse of generics that puts a hard limit on their usefulness in this context. I expect people who'd consume the raw grapheme breaking entry points would generally prefer to roll their own decoders than relying on those. (Writing an UTF-8 or UTF-16 decoder is nowhere near as difficult a problem as implementing the grapheme breaking algorithm: unlike the latter, decoders only need a single investment of effort -- they do not need to be kept updated with yearly Unicode changes.) This is not to say the stdlib shouldn't have better solutions for decoding; but it seems far less of a pressing issue than exposing the grapheme breaking state machine. Creating a decoder interface that efficiently supports all the weird ways people might want to store and access their text data is an extremely non-trivial task.

If we ever add direct fast paths for grapheme breaking UTF-8/16 data, then we can expose new entry points for them at the same time. I doubt this will happen any time soon, so I do not expect it would be a good idea to speculatively add such entry points at this time -- especially since the right interface would heavily depend on the actual implementation of those fast paths. (As I said, I expect such fast paths would be all about processing many characters' worth of data at once -- an interface for iterating over grapheme clusters one by one could be limiting throughput.) However, this is actually one of the things I want to experiment with -- in actual use, I may find a satisfying way to express these.

Copy link
Contributor

@Azoy Azoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a few concerns about this, but it seems you addressed them in the comments with talking to @karwa. I think this looks good!

@lorentey
Copy link
Member Author

lorentey commented Jan 3, 2023

I find the constraint that the initializer needs to be supplied with the first Unicode scalar way too tedious in practice -- I think I'll replace it with an empty initializer instead, and have all scalars be fed to hasCharacterBoundary(before:).

This will also have the side effect of the recognizer reporting an unconditional grapheme break before the first scalar. This follows Annex #29 but -- somewhat surprisingly -- it actually also seems to be the more convenient choice.

@lorentey
Copy link
Member Author

lorentey commented Jan 4, 2023

In the last commit I also added a method for finding grapheme breaks in UTF-8 data. It is helpful, but it also illustrates a common gotcha -- it assumes that the given buffer contains valid UTF-8 data and it produces undefined behavior if it doesn't.

We could consider adding a validating version that would lose the scary "unchecked-unsafe" label, but we may want to keep this one, too, for "high-throughput" grapheme breaking. 🤔

@lorentey
Copy link
Member Author

lorentey commented Jan 4, 2023

@swift-ci test

@lorentey
Copy link
Member Author

lorentey commented Jan 4, 2023

@swift-ci test macOS platform

@lorentey
Copy link
Member Author

lorentey commented Jan 5, 2023

java.nio.file.FileSystemException: /Users/ec2-user/jenkins/workspace/swift-PR-macos@tmp/durable-677cc71b: No space left on device

@swift-ci smoke test macOS platform

@lorentey lorentey merged commit c945561 into swiftlang:main Jan 5, 2023
@lorentey lorentey deleted the character-recognizer branch January 5, 2023 07:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants