-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support binary strings, preserve UTF-8 and UTF-16 errors #2314
base: master
Are you sure you want to change the base?
Conversation
Also forgot to mention, I think I've made all the necessary updates to the basic string manipulation operations such as searching, concatenating, splitting, indexing. I've also updated To avoid making this PR too monstrous, I've left some operations like Also, it might or might not be desirable to normalise UTF-8 and UTF-16 errors when concatenating strings. |
For the record, the CI failures above are environmental. 3 of the 6 Travis CI builds passed. The other 3 Travis CI builds and the 2 AppVeyor builds failed for reasons unrelated to the PR changes. |
Force pushed because I forgot to include replacement character emissions in |
I've gone ahead and added another commit to do the aforementioned normalisation of UTF-8/UTF-16 errors into well-formed UTF-8 during string concatenation. Also took the opportunity to optimise part of the UTF-8 decoding mechanism, since it's reused in the join mechanism, so decoding is now faster than on master. |
Should also point out that my last commit pretty much makes the |
223d6c4
to
8829368
Compare
This is a violation of Unicode.
https://unicode.org/faq/utf_bom.html https://www.unicode.org/versions/Unicode14.0.0/ch03.pdf (Paragraphs D89 and C10) |
|
My implementation does not interpret illegal or ill-formed byte sequences as characters. It interprets both as errors that can be replayed when output.
My implementation does not generate such a sequence. It is able to read such an ill-formed sequence and it can replay it if the output supports it (eg, it's possible to have such a sequence in standard input or on the command line, and that sequence will be maintained when emitted to standard out). Both of these aspects are in accordance with Unicode, which discusses handling of ill-formed strings in other parts of the paragraph you've already referred to (http://www.unicode.org/versions/Unicode14.0.0/ch03.pdf D89):
The paragraph goes on to demonstrate concatenation of ill-formed UTF-16 strings to create a well-formed UTF-16 string [0] (this works in my implementation, not possible on master), and it gives an example of an ill-formed UTF-8 string which could not possibly be concatenated to create a well-formed UTF-8 string [1]. These are both considered by the Unicode standard to be possible Unicode strings, just ones that are not well-formed. My implementation handles ill-formed Unicode strings from both encodings. [0] Here's a demonstration from my branch of the Unicode 16-bit example, concatenating
[1] Here's a demonstration from my branch of the Unicode 8-bit example, where the string
|
I'm not sure I understand your issue here. Where do you think it behaves differently based on some input encoding? All this is doing is maintaining ill-formed input code units where possible, and replaying them back in the output if possible. Input and output in |
On Tue., Jan. 11, 2022, 4:35 p.m. Maxdamantus, ***@***.***> wrote:
> jq should not behave differently based on the encoding of the input.
I'm not sure I understand your issue here. Where do you think it behaves
differently based on some input encoding?
I did not look at the code. Your the one who said it behaved differently in
the passage I quoted ("UTF-8 errors are maintained and UTF-16 errors are converted into replacement characters").
It should either maintain the errors for both encodings, and convert them into replacement characters for both.
|
Yes it does cause this to happen. It could be a string with an invalid byte, which you end up passing to
It causes jq to generate such sequences. jq doesn't merely replay its input. The whole point of jq is to generate a new document, and this new document may contains these sequences. They may even be modified and duplicated, not just "replayed". Your own explanation specifically says that you convert to these illegal sequences. |
Yes, but that isn't relevant. At issue is the production of invalid UTF-16 strings. |
Care to explain why? Note that I'm talking specifically about the case where a raw string is being output (eg, using In general it's going to be quite unlikely that a string will have errors of both types, since usually ill-formed strings will have originated in a 16-bit string system (eg, JavaScript, Java, Qt, some Windows API) in which case you'll have UTF-16 errors or they will have originated in an 8-bit string system (eg, Go, C, some POSIX API, some data from a file) in which case you'll have UTF-8 errors. Here's a contrived example with both sorts of errors:
|
I only "convert" to legal sequences (replacement characters—this is pretty normal behaviour in Unicode handling, and it's described in the Unicode chapter that's already been linked). The point of this changeset is to handle these errors and also to avoid conversion where possible, since such conversion causes unnecessary corruption of binary data. |
Invalid UTF-16 strings are only produced from already invalid UTF-16 strings. This can also be extrapolated from the Unicode chapter that's been referred to. They gave an example of concatenating invalid UTF-16 strings to create a valid UTF-16 string, but you can obviously also end up with another invalid UTF-16 string. It is not possible to concatenate valid Unicode strings and end up with an invalid Unicode string. If you're not concatenating strings though, no strings are "produced", only passed through, so there's no visible processing. |
Ah, your earlier passage indicated the fix treated UTF-8 and UTF-16 differently, but that wasn't accurate. It sounds like you're saying jq always emits UTF-8, so you can't generate the original malformed UTF-16 substring. Makes sense. |
Yes, I know. Your code only generates an invalid output string when given an invalid input string. That's not in contention. The issue is that it causes jq to generate an invalid string under some circumstance. That's a violation of Unicode.
You can, but you're not allowed to. The paragraph specifically says you're allowed to concatenate things that aren't UTF-* to produce something that's valid UTF-*. The point of the paragraph is to say that it's ok for buffers to contain incomplete streams, etc. It doesn't say you're allowed you produce invalid UTF-*. Quite the opposite, it's clearly stated that you must produce valid UTF-*. |
How can a buffer (a Unicode string) contain ill-formed Unicode without being produced? Something produced that ill-formed Unicode string. These strings are used as examples in the Unicode chapter.
This seems like a very selective interpretation of the Unicode standard, and it goes against how strings work in typical programming languages. The chapter referred to basically aligns with how strings work in most programming languages [0], which is as a sequence of code units, either 8-bit (aka "bytes") or 16-bit. Such strings are not necessarily purported to be well-formed Unicode. The only point of contention I can find between this changeset and the Unicode chaper is in this paragraph:
Since this changeset is trying to cater to both UTF-8 and UTF-16, it doesn't actually treat strings as sequences of code units, so arguably the "not permissible to mix forms" part does not apply. This at least applies to ill-formed [0] At least ones like C, Go, Java, JavaScript, Python 2.7—interestingly, systems that try to add special "support" for Unicode like Python 3, Haskell and Rust have diverged from this model. Side note: personally, I prefer the 8-bit code units model described here, which has explicitly been adopted by Go, probably not coincidentally, given the overlap between designers of UTF-8 and Go. |
The buffer in question doesn't contain Unicode. It contains bytes received. That's the whole point. And here's how it can happen: read(fd, buf, 5) // UTF-8 Stream is 61 62 63 64 C3 A9
No, it's extremely clear this is to what the paragraph refers.
No, it's extremely common to have buffers that contain partial strings.
It's very common for them to be other things too. C's wide char is 32-bit for me, allowing it to represent any Unicode Code Point as a single char. Perl and Python can also represent any Unicode Code Point as a single char. None of this is relevant. The issue I brought up isn't with what you do internally. There's no problem there. The problem is that your changes cause jq to generate an illegal UTF-8 string. |
Okay, so by doing that you have "produced" what Unicode ch03 calls a "Unicode 8-bit string" (paragraphs D80, D81). It is ill-formed due to the lack of a continuation byte. Whether you produced it by doing 1 read of 5 bytes or 5 reads of 1 byte should be irrelevant. If you're still really insistent that concatenation should be considered a special operation that must enforce well-formedness, consider this sequence of operations on your example string:
The first concatenation is of the form
Feel free to point out where the Unicode standard says that concatenation or some other operation must produce a well-formed string. I think the standard makes a fairly clear distinction of creating strings that "purport" to be well-formed. Typical string operations such as concatenation don't normally purport their results to be well-formed under all circumstances.
.. This sounds supportive of the point I was making. My point was that typical programming languages allow strings that are not valid Unicode (for example allowing partial strings).
If you still think there's a problem here after reading my hopefully clear demonstration above, can you give a precise example of where you think the behaviour of my branch should be different? I think by this point the behaviour should be fairly clear from the initial PR description as well as the supporting examples I've shown through this discussion. |
I'll take a look at this. I haven't yet, but my idea for binary support was something like this:
|
My feeling is that most of the functionality you're suggesting (particularly, As an optimisation, The (EDIT: I think I got (And this PR already updates |
@Maxdamantus sorry for the delay. I've started to look at this, but in parallel I've been re-thinking how to do binary and have started to implement something not unlike your PR just to experiment. First, my immediate thoughts about this PR:
Having binary be represented as an array of small numbers in jq code is probably a bad idea -- arrays are everywhere in jq, and adding all sorts of branches for binary seems like not likely to be good for performance. String operations, I think, are probably less performance sensitive. As well, I'm a bit disappointed with
|
@Maxdamantus see #2736. What do you think? |
@nicowilliams Thanks for looking, better late than never!
As far as I can tell, noone else has used this particular scheme, but it's hard to verify. I've referred to it as "WTF-8b" because it achieves the goals of both WTF-8 (able to encode UTF-16 errors) and UTF-8b (able to encode UTF-8 errors, probably better known as "surrogateescape"). I've encoded UTF-16 errors the same way as in WTF-8, but it's not possible to directly use the UTF-8b/surrogateescape encoding in conjunction with WTF-8. I think the use case here is a bit niche. I'm not aware of other systems where it would be appropriate to support both UTF-8 and UTF-16 strings at the same time. IMO Since The closest example of something like this combined encoding I can think of is Java's "Modified UTF-8". This is used in compiled I'll also point out WTF-8b is intended as an internal encoding. In the current changeset, it is exposed through
I don't think private-use code points would be suitable, as these are valid Unicode, so it would be incorrect to reinterpret them as UTF-8 errors (or to interpret UTF-8 errors as valid Unicode), eg HERE 's something I found through GitHub code search, where there are private-use code points that presumably map to characters in a special font. These characters need to be distinguished from UTF-8 errors. I'm not sure what the advantage would be of using the last 128 overlong sequences rather than the first ones. The last overlong sequences would be 4 bytes long, and there are 65,536 overlong 4-byte sequences (ie, the number of code points below Using the first overlong sequences seems quite natural, since they are only 2 bytes long, and there are exactly 128 overlong 2-byte sequences, and exactly 128 possible unpaired UTF-8 bytes. It seems to match exactly, so if someone else were to independently create an encoding under the same constraints, they'd likely come up with the same mapping.
I think my intention was to avoid complicating the existing
I certainly think further operations on binary strings would be useful (my PR is not focused on providing such functions, but only adding representational support so that the data does not get corrupted or cause errors), but I feel like creating hard distinctions between different types of strings is unnecessary. I think the main thing you'd want from "binary" strings is byte-based indexing instead of code point-based indexing.
Have you thought about making binary strings that are basically just alternative views over normal strings? So I suspect the alternative view method would be simpler implementation-wise, since having different string/binary representations probably means the C code will have to cater for the different representations (eg, [0] Actually, it's slightly different in that it treats all surrogates equally as errors, whether they are paired or not, so [1] Not sure if these should be strings or integers. It looks like your PR makes them strings. I don't think I feel strongly either way, but I have a slight tendendency in favour of |
8829368
to
366f238
Compare
I don't think private-use code points would be suitable, as these are
valid Unicode
Indeed.
where there are private-use code points that presumably map to characters
in a special font
There's a group that assigned Klingon characters to a range of private-use
code points.
…On Thu, Jul 20, 2023 at 6:03 AM Maxdamantus ***@***.***> wrote:
@nicowilliams <https://github.com/nicowilliams> Thanks for looking,
better late than never!
* jq should not be the first thing sporting "WTF-8b" support -- is it your invention? are there other uses of it?
As far as I can tell, noone else has used this particular scheme, but it's
hard to verify. I've referred to it as "WTF-8b" because it achieves the
goals of both WTF-8 (able to encode UTF-16 errors) and UTF-8b (able to
encode UTF-8 errors, probably better known as "surrogateescape"). I've
encoded UTF-16 errors the same way as in WTF-8, but it's not possible to
directly use the UTF-8b/surrogateescape encoding in conjunction with WTF-8.
I think the use case here is a bit niche. I'm not aware of other systems
where it would be appropriate to support both UTF-8 and UTF-16 strings at
the same time.
IMO jq is should at least handle JSON produced by UTF-16 systems (eg,
JavaScript where strings such as "\udca9\ud83d" can be emitted).
Since jq is commonly used as a unix tool for passing around text, I think
it should also handle text in ways that are compatible with other unix
tools. Eg, if I run tail -n foo.txt, I expect it to maintain the same
bytes that were in the original file—it shouldn't fail due to illegal
formatting or replace some of the bytes with replacement characters.
The closest example of something like this combined encoding I can think
of is Java's "Modified UTF-8". This is used in compiled .class files for
representing string constants. UTF-16 data is encoded mostly[0] the same
way as WTF-8, but instead of encoding "\u0000" as a single byte "\x00",
it's encoded using the first overlong coding, "\xC0\x80".
I'll also point out WTF-8b is intended as an internal encoding. In the
current changeset, it is exposed through @uri, but I think it would be
cleaner to transform the UTF-16 errors here into replacement characters. I
have however found @uri useful for inspecting the internal
representation. Theoretically it should be possible to change the internal
representation from WTF-8b to something else (though I'm doubtful that a
better encoding exists).
* I'm not sure that we should steal from the over-long UTF-8 sequences to represent `0x80`-`0xff` -- like you I want UTF-16 to die and go away and never come back, and then we could have more than 21 bits of code space for Unicode, so using overlong sequences would not be advisable if I had a magic wand with which to get my way!
Maybe we could use private-use codepoints instead? Or the _last_ 128 overlong sequences rather than the first?
I don't think private-use code points would be suitable, as these are
valid Unicode, so it would be incorrect to reinterpret them as UTF-8 errors
(or to interpret UTF-8 errors as valid Unicode), eg HERE
<https://github.com/oniani/covid-19-chatbot/blob/46f0afd5341b7f8061779564125d7ca5481c5b10/data_raw/1f4ec41f723e758522faa99829a52f00ea45a9e2.json#L2501>
's something I found through GitHub code search, where there are
private-use code points that presumably map to characters in a special
font. These characters need to be distinguished from UTF-8 errors.
I'm not sure what the advantage would be of using the last 128 overlong
sequences rather than the first ones. The last overlong sequences would be
4 bytes long, and there are 65,536 overlong 4-byte sequences (ie, the
number of code points below U+10000, the first that is encoded as 4 bytes
in UTF-8).
Using the first overlong sequences seems quite natural, since they are
only 2 bytes long, and there are exactly 128 overlong 2-byte sequences, and
exactly 128 possible unpaired UTF-8 bytes. It seems to match exactly, so if
someone else were to independently create an encoding under the same
constraints, they'd likely come up with the same mapping.
* I'm very surprised to see changes to `src/builtin.c`:`f_json_parse()`!
Why not just add new functions for new options?
I think my intention was to avoid complicating the existing jv_parse_sized
function or to avoid creating an extra exposed function (
jv_parse_extended_sized?). I believe this was the part of the PR that I
felt least comfortable with stylistically, so I'd be happy to revisit this.
So here's my new idea, which I think we can use to add something like
WTF-8 and so on like you're doing, possibly sharing code:
I certainly think further operations on binary strings would be useful (my
PR is not focused on providing such functions, but only adding
representational support so that the data does not get corrupted or cause
errors), but I feel like creating hard distinctions between different types
of strings is unnecessary.
I think the main thing you'd want from "binary" strings is byte-based
indexing instead of code point-based indexing.
jq currently does neither, but I see that your PR (#2736
<#2736>) is working towards code
point-based indexing for normal strings, which makes sense (actually, IIUC,
it only does iterating at the moment, but I think indexing is a logical
next step, so that .[$x] is the same as [.[]][$x]).
Have you thought about making binary strings that are basically just
alternative views over normal strings? So $str | .[] would emit code
points[1], but $str | asbinary | .[] would emit bytes. $str | asbinary |
asstring could just be a noop (it could force the UTF-16 errors into
replacement characters, but I don't think it's very useful). If the
indexing is meant to return a string (or binary string) instead of an
integer, I think it would still be sensible for "\uD800 X"[0] and ("\uD800
X" | asbinary)[0] to return the same string (or binary string)
representing that illegal UTF-16 surrogate (this adheres to the principle
of preserving data, but in practice I think this is an extreme edge case,
since you wouldn't normally perform binary operations on ill-formed UTF-16
data).
I suspect the alternative view method would be simpler
implementation-wise, since having different string/binary representations
probably means the C code will have to cater for the different
representations (eg, @base64 is going to have to convert from normal
strings and binary strings .. and what type does @base64d emit? at the
moment it emits a normal string, which should iterate through the code
points—changing it to return a binary string seems a bit presumtive).
[0] Actually, it's slightly different in that it treats all surrogates
equally as errors, whether they are paired or not, so "💩" is first
turned into the UTF-16 surrogates <D83D DCA9> and the surrogates are
encoded individually.
[1] Not sure if these should be strings or integers. It looks like your PR
makes them strings. I don't think I feel strongly either way, but for
consistency between normal strings and binary strings, it might make more
sense
—
Reply to this email directly, view it on GitHub
<#2314 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAFK2ZPG6ZQXNP76LB3J5XTXRDX7LANCNFSM45GTTNYQ>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I've rebased this PR onto the current master without making any significant changes (only resolving a merge conflict with #2633). I intend to push some code improvements later. |
But on the plus side such output would be accepted and preserved by existing software.
I'm aware. Ideally we could get 128 codepoints assigned for this. Yes, you'd have to know to decode from UTF-8 using those to binary, but their presence would be sufficient to indicate that a string is in fact a binary blob. |
You would need an external flag in this case to denote that it's a binary blob, since it's possible for binary data to already contain bytes that look like an encoding of these hypothetical code points. For example, imagine that If an external flag is used, there isn't much point in reencoding the binary data, since it could be stored in plain binary form. The downside is that operations have to account for the different string representations (and concatenation between different representations would fail in some cases). |
You'd have to know yes, and the point is that if jq produces it (unless it's a mode where it produces actual binary) then it's UTF-8. The "external flag" here is implied as long as you don't leak this "UTF-8" into contexts where binary is expected without first decoding. We have this problem in spades. Filesystem related system calls in Unix are all 8-bit clean except for ASCII Anyways, I think for jq all of this is out of scope. WTF-8 might be in scope, but WTF-8b... I think would require socializing more. |
366f238
to
a98f863
Compare
The internal string representation is changed from UTF-8 with replacement characters to a modified form of "WTF-8" that is able to distinctly encode UTF-8 errors and UTF-16 errors. This handles UTF-8 errors in raw string inputs and handles UTF-8 and UTF-16 errors in JSON input. UTF-16 errors (using "\uXXXX") and UTF-8 errors (using the original raw bytes) are maintained when emitting JSON. When emitting raw strings, UTF-8 errors are maintained and UTF-16 errors are converted into replacement characters.
UTF-8 errors and UTF-16 errors that were previously encoded into the ends of strings will now potentially be used to form correct code points. This is mostly a matter of making string equality behave expectedly, since without this normalisation, it is possible to produce `jv` strings that are converted to UTF-8 or UTF-16 the same way but are not equal due well-formed code units that may or may not be encoded as errors.
Errors are emitted as negative code points instead of being transformed into replacement characters. `implode` is also updated accordingly so the original string can be reconstructed without data loss.
This is no longer needed as strings are capable of storing partial UTF-8 sequences.
a98f863
to
5c2fe32
Compare
I've made some updates to these changes:
|
The internal string representation is changed from UTF-8 with replacement characters to a modified form of "WTF-8" that is able to distinctly encode UTF-8 errors and UTF-16 errors.
This handles UTF-8 errors in raw string inputs and handles UTF-8 and UTF-16 errors in JSON input. UTF-16 errors (using "\uXXXX") and UTF-8 errors (using the original raw bytes) are maintained when emitting JSON. When emitting raw strings, UTF-8 errors are maintained and UTF-16 errors are converted into replacement characters.
As well as being able to be used as a "pretty printer" without risk of corrupting binary data (eg, JSON strings such as "\uDC00\uD800!", or arbitrary bytes in input), it is now possible to pass around arbitrary files either as JSON or raw data (using ill-formed UTF-8):
To demonstrate UTF-16 error preservation:
Fixes at least #1931, #1540, #2259.