-
Notifications
You must be signed in to change notification settings - Fork 64
Make functions to read and write code points public #308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
KDoc preview: |
core/common/src/Utf8.kt
Outdated
* 1 or more non-UTF-8 bytes and return the replacement character (`U+fffd`). This covers encoding | ||
* problems (the input is not properly-encoded UTF-8), characters out of range (beyond the | ||
* problems (the input is not properly encoded UTF-8), characters out of range (beyond the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this source doesn't start with a properly encoded UTF-8 code point, this method will remove
1 or more non-UTF-8 bytes and return the replacement character (U+fffd
). This covers <...> code points for UTF-16 surrogates (U+d800
..U+dfff
)
What this means, exactly? We can decode 1 whole code point that later forms a surrogate pair. Then in what case UTF-16 surrogates are replaced with a replacement char?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What this means, exactly?
If a prefix of data contained in a Source
is an ill-formed code unit sequence from the UTF-8 point of view, this method will remove these code units from the source and return the replacement code point.
If there's a well-formed code unit sequence, it will be removed from the Source and decoded. If a decoded value falls out of valid UTF-16 code point value ranges, we'll also return the replacement code point.
I think that rephrasing If this source doesn't start with a properly encoded UTF-8 code point
to If this source starts with an ill-formed UTF-8 code unit sequence
should help with translating the idea. WDYT?
core/common/src/Utf8.kt
Outdated
* | ||
* The replacement character (`U+fffd`) will be also returned if the source starts with a well-formed | ||
* code units sequences, but a decoded value does not pass further validation, such as | ||
* the value is of range (beyond the `0x10ffff` limit of Unicode), maps to UTF-16 surrogates (`U+d800`..`U+dfff`), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* the value is of range (beyond the `0x10ffff` limit of Unicode), maps to UTF-16 surrogates (`U+d800`..`U+dfff`), | |
* the value is out of range (beyond the `0x10ffff` limit of Unicode), maps to UTF-16 surrogates (`U+d800`..`U+dfff`), |
It's better now, but maybe my Unicode knowledge still fails me: why a decoded value does not pass further validation
if it maps to UTF-16 surrogates ('U+d800'..'U+dfff')
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why a decoded value does not pass further validation if it maps to UTF-16 surrogates ('U+d800'..'U+dfff')
- Conformance / § 3.9 Unicode Encoding Forms / D92
Because surrogate code points are not Unicode scalar values,
any UTF-8 byte sequence that would otherwise map to code points U+D800..U+DFFF is ill-formed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I guess the meaning is that 'byte sequence should form a whole code point, not a half of it' and 'valid utf-16 is not always a valid utf-8'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
byte sequence should form a whole code point, not a half of it
That's a funny part:
- Unicode code point is an integer in [0, 0x1FFFF] range (D10, D9 from 3.4 Characters and Encoding)
- Unicode scalar values are all code points except surrogates (D76 from 3.9 Unicode Encoding Forms)
- Surrogates should not be interpreted as abstract characters and are subject to restrictions in interchanges as per 2.4 Code Points and Characters.
But the abovementioned D92 simply prohibits encoding surrogates using UTF-8, despite surrogates being (almost) regular Unicode code points.
Co-authored-by: Leonid Startsev <sandwwraith@users.noreply.github.com>
b4984f8
to
d297e7a
Compare
* which then could be written to a [Sink]. | ||
* Without such a conversion, data written to a [Sink] can not be converted back | ||
* to a string from which a surrogate pair was retrieved. | ||
* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The documentation does not say anything about what will be written if we pass a code point with the surrogate value (U+d800
..U+dfff
).
E.g. on the JVM will be written single byte with value 63 (?
)
Also we should add test on it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, we always writing '?'. Utf8Test::writeSurrogateCodePoint
covers that scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this is not mentioned in the KDoc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing to that! Opened #314
*/ | ||
@OptIn(DelicateIoApi::class) | ||
internal fun Sink.writeUtf8CodePoint(codePoint: Int): Unit = | ||
public fun Sink.writeCodePointValue(codePoint: Int): Unit = | ||
writeToInternalBuffer { it.commonWriteUtf8CodePoint(codePoint) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it handle negative values of codePoint
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not. Opened #317
Fixes #307