Skip to content

Make functions to read and write code points public #308

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 7, 2024

Conversation

fzhinkin
Copy link
Collaborator

@fzhinkin fzhinkin commented May 3, 2024

Fixes #307

@fzhinkin fzhinkin linked an issue May 3, 2024 that may be closed by this pull request
@fzhinkin fzhinkin marked this pull request as ready for review May 3, 2024 09:12
@fzhinkin
Copy link
Collaborator Author

fzhinkin commented May 3, 2024

KDoc preview:

@fzhinkin fzhinkin requested review from shanshin and sandwwraith May 3, 2024 09:45
* 1 or more non-UTF-8 bytes and return the replacement character (`U+fffd`). This covers encoding
* problems (the input is not properly-encoded UTF-8), characters out of range (beyond the
* problems (the input is not properly encoded UTF-8), characters out of range (beyond the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this source doesn't start with a properly encoded UTF-8 code point, this method will remove
1 or more non-UTF-8 bytes and return the replacement character (U+fffd). This covers <...> code points for UTF-16 surrogates (U+d800..U+dfff)

What this means, exactly? We can decode 1 whole code point that later forms a surrogate pair. Then in what case UTF-16 surrogates are replaced with a replacement char?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What this means, exactly?

If a prefix of data contained in a Source is an ill-formed code unit sequence from the UTF-8 point of view, this method will remove these code units from the source and return the replacement code point.

If there's a well-formed code unit sequence, it will be removed from the Source and decoded. If a decoded value falls out of valid UTF-16 code point value ranges, we'll also return the replacement code point.

I think that rephrasing If this source doesn't start with a properly encoded UTF-8 code point to If this source starts with an ill-formed UTF-8 code unit sequence should help with translating the idea. WDYT?

@fzhinkin fzhinkin requested a review from sandwwraith May 6, 2024 14:26
*
* The replacement character (`U+fffd`) will be also returned if the source starts with a well-formed
* code units sequences, but a decoded value does not pass further validation, such as
* the value is of range (beyond the `0x10ffff` limit of Unicode), maps to UTF-16 surrogates (`U+d800`..`U+dfff`),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* the value is of range (beyond the `0x10ffff` limit of Unicode), maps to UTF-16 surrogates (`U+d800`..`U+dfff`),
* the value is out of range (beyond the `0x10ffff` limit of Unicode), maps to UTF-16 surrogates (`U+d800`..`U+dfff`),

It's better now, but maybe my Unicode knowledge still fails me: why a decoded value does not pass further validation if it maps to UTF-16 surrogates ('U+d800'..'U+dfff')?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why a decoded value does not pass further validation if it maps to UTF-16 surrogates ('U+d800'..'U+dfff')

  1. Conformance / § 3.9 Unicode Encoding Forms / D92
Because surrogate code points are not Unicode scalar values, 
any UTF-8 byte sequence that would otherwise map to code points U+D800..U+DFFF is ill-formed.

https://www.unicode.org/versions/Unicode15.1.0/ch03.pdf

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I guess the meaning is that 'byte sequence should form a whole code point, not a half of it' and 'valid utf-16 is not always a valid utf-8'.

Copy link
Collaborator Author

@fzhinkin fzhinkin May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

byte sequence should form a whole code point, not a half of it

That's a funny part:

  • Unicode code point is an integer in [0, 0x1FFFF] range (D10, D9 from 3.4 Characters and Encoding)
  • Unicode scalar values are all code points except surrogates (D76 from 3.9 Unicode Encoding Forms)
  • Surrogates should not be interpreted as abstract characters and are subject to restrictions in interchanges as per 2.4 Code Points and Characters.

But the abovementioned D92 simply prohibits encoding surrogates using UTF-8, despite surrogates being (almost) regular Unicode code points.

@fzhinkin fzhinkin requested a review from sandwwraith May 7, 2024 09:09
fzhinkin and others added 4 commits May 7, 2024 11:37
@fzhinkin fzhinkin force-pushed the 307-allow-reading-codepoints branch from b4984f8 to d297e7a Compare May 7, 2024 09:42
@fzhinkin fzhinkin merged commit 2d134d7 into develop May 7, 2024
@fzhinkin fzhinkin deleted the 307-allow-reading-codepoints branch May 7, 2024 09:43
* which then could be written to a [Sink].
* Without such a conversion, data written to a [Sink] can not be converted back
* to a string from which a surrogate pair was retrieved.
*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation does not say anything about what will be written if we pass a code point with the surrogate value (U+d800..U+dfff).
E.g. on the JVM will be written single byte with value 63 (?)

Also we should add test on it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, we always writing '?'. Utf8Test::writeSurrogateCodePoint covers that scenario.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is not mentioned in the KDoc

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing to that! Opened #314

*/
@OptIn(DelicateIoApi::class)
internal fun Sink.writeUtf8CodePoint(codePoint: Int): Unit =
public fun Sink.writeCodePointValue(codePoint: Int): Unit =
writeToInternalBuffer { it.commonWriteUtf8CodePoint(codePoint) }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it handle negative values of codePoint?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not. Opened #317

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make functions to read and write code points public
4 participants