Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support binary strings, preserve UTF-8 and UTF-16 errors #2314

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

Maxdamantus
Copy link

@Maxdamantus Maxdamantus commented May 20, 2021

The internal string representation is changed from UTF-8 with replacement characters to a modified form of "WTF-8" that is able to distinctly encode UTF-8 errors and UTF-16 errors.

This handles UTF-8 errors in raw string inputs and handles UTF-8 and UTF-16 errors in JSON input. UTF-16 errors (using "\uXXXX") and UTF-8 errors (using the original raw bytes) are maintained when emitting JSON. When emitting raw strings, UTF-8 errors are maintained and UTF-16 errors are converted into replacement characters.

As well as being able to be used as a "pretty printer" without risk of corrupting binary data (eg, JSON strings such as "\uDC00\uD800!", or arbitrary bytes in input), it is now possible to pass around arbitrary files either as JSON or raw data (using ill-formed UTF-8):

$ sha1sum jq
ff8e6e9fd7d87eb1c9179da546ccbbcd77a40f14  jq
$ ./jq -n '$a' --rawfile a jq | ./jq -j | sha1sum
ff8e6e9fd7d87eb1c9179da546ccbbcd77a40f14  -
$ base64 jq | ./jq -jR '@base64d' | sha1sum
ff8e6e9fd7d87eb1c9179da546ccbbcd77a40f14  -

To demonstrate UTF-16 error preservation:

$ node -e 'console.log(JSON.stringify("💩".split("")));'
["\ud83d","\udca9"]
$ node -e 'console.log(JSON.stringify("💩".split("")));' | ./jq .
[
  "\ud83d",
  "\udca9"
]

Fixes at least #1931, #1540, #2259.

@Maxdamantus
Copy link
Author

Maxdamantus commented May 20, 2021

Also forgot to mention, I think I've made all the necessary updates to the basic string manipulation operations such as searching, concatenating, splitting, indexing. I've also updated @base64 and fromjson as those are relatively likely to be used with binary data.

To avoid making this PR too monstrous, I've left some operations like @html and @uri for now, so it is possible for these to expose the internal representation in the presence of ill-formed Unicode.

Also, it might or might not be desirable to normalise UTF-8 and UTF-16 errors when concatenating strings. This is also not yet implemented (EDIT: now it is). (eg, "\uD83D" + "\uDCA9" will currently would produce a string that is distinct from "\uD83D\uDCA9" or "💩", though this normalisation can be done using the tojson | fromjson filter).

@coveralls
Copy link

coveralls commented May 20, 2021

Coverage Status

Coverage increased (+1.07%) to 85.207% when pulling 223d6c4 on Maxdamantus:210520-wtf8b into d18b2d0 on stedolan:master.

@Maxdamantus
Copy link
Author

Maxdamantus commented May 20, 2021

For the record, the CI failures above are environmental. 3 of the 6 Travis CI builds passed. The other 3 Travis CI builds and the 2 AppVeyor builds failed for reasons unrelated to the PR changes.

@Maxdamantus
Copy link
Author

Force pushed because I forgot to include replacement character emissions in @base64 and utf8bytelength. Also removed an unnecessary static function in jv_parse.c.

@Maxdamantus
Copy link
Author

I've gone ahead and added another commit to do the aforementioned normalisation of UTF-8/UTF-16 errors into well-formed UTF-8 during string concatenation. Also took the opportunity to optimise part of the UTF-8 decoding mechanism, since it's reused in the join mechanism, so decoding is now faster than on master.

@Maxdamantus
Copy link
Author

Should also point out that my last commit pretty much makes the jvp_utf8_backtrack mechanism obsolete, since the invalid bytes are stored and corrected on concatenation. This means that #2259 is now incidentally fixed, since that bug arises from a missing application of jvp_utf8_backtrack.

@ikegami
Copy link

ikegami commented Jan 11, 2022

This is a violation of Unicode.

A conformant process must not interpret illegal or ill-formed byte sequences as characters

A sequence such as 110xxxxx[b2] 0xxxxxxx[b2] is ill-formed and must never be generated.

https://unicode.org/faq/utf_bom.html

https://www.unicode.org/versions/Unicode14.0.0/ch03.pdf (Paragraphs D89 and C10)

@ikegami
Copy link

ikegami commented Jan 11, 2022

UTF-8 errors are maintained and UTF-16 errors are converted into replacement characters.

jq should not behave differently based on the encoding of the input.

@Maxdamantus
Copy link
Author

Maxdamantus commented Jan 11, 2022

A conformant process must not interpret illegal or ill-formed byte sequences as characters

My implementation does not interpret illegal or ill-formed byte sequences as characters. It interprets both as errors that can be replayed when output.

A sequence such as 110xxxxx[b2] 0xxxxxxx[b2] is ill-formed and must never be generated.

My implementation does not generate such a sequence. It is able to read such an ill-formed sequence and it can replay it if the output supports it (eg, it's possible to have such a sequence in standard input or on the command line, and that sequence will be maintained when emitted to standard out). Both of these aspects are in accordance with Unicode, which discusses handling of ill-formed strings in other parts of the paragraph you've already referred to (http://www.unicode.org/versions/Unicode14.0.0/ch03.pdf D89):

Unicode strings need not contain well-formed code unit sequences under all conditions. This is equivalent to saying that a particular Unicode string need not be in a Unicode encoding form.

The paragraph goes on to demonstrate concatenation of ill-formed UTF-16 strings to create a well-formed UTF-16 string [0] (this works in my implementation, not possible on master), and it gives an example of an ill-formed UTF-8 string which could not possibly be concatenated to create a well-formed UTF-8 string [1]. These are both considered by the Unicode standard to be possible Unicode strings, just ones that are not well-formed. My implementation handles ill-formed Unicode strings from both encodings.


[0] Here's a demonstration from my branch of the Unicode 16-bit example, concatenating <004D D800> and <DF02 004D>:

$ jq -n '"\u004D\uD800" + "\uDF02\u004D"'
"M𐌂M"

[1] Here's a demonstration from my branch of the Unicode 8-bit example, where the string <C0 80 61 F3> is preserved from input to output (first two bytes are an invalid "overlong" representation, and the last byte is a leading byte with no trailing bytes):

$ echo 'C0 80 61 F3' | xxd -r -p | jq -sR | jq -j | xxd
00000000: c080 61f3                                ..a.

@Maxdamantus
Copy link
Author

jq should not behave differently based on the encoding of the input.

I'm not sure I understand your issue here. Where do you think it behaves differently based on some input encoding? All this is doing is maintaining ill-formed input code units where possible, and replaying them back in the output if possible. Input and output in jq is all basically just UTF-8 (or ASCII if using the -a option). UTF-16 handling only happens based on the possibility of representing UTF-16 code units in JSON.

@ikegami
Copy link

ikegami commented Jan 12, 2022 via email

@ikegami
Copy link

ikegami commented Jan 12, 2022

My implementation does not interpret illegal or ill-formed byte sequences as characters.

Yes it does cause this to happen. It could be a string with an invalid byte, which you end up passing to match, for example.

My implementation does not generate such a sequence.

It causes jq to generate such sequences. jq doesn't merely replay its input. The whole point of jq is to generate a new document, and this new document may contains these sequences. They may even be modified and duplicated, not just "replayed".

Your own explanation specifically says that you convert to these illegal sequences.

@ikegami
Copy link

ikegami commented Jan 12, 2022

The paragraph goes on to demonstrate concatenation of ill-formed UTF-16 strings to create a well-formed UTF-16 string

Yes, but that isn't relevant. At issue is the production of invalid UTF-16 strings.

@Maxdamantus
Copy link
Author

Maxdamantus commented Jan 12, 2022

I did not look at the code. Your the one who said it behaved differently in
the passage I quoted ("UTF-8 errors are maintained and UTF-16 errors are converted into replacement characters").

It should either maintain the errors for both encodings, and convert them into replacement characters for both.

Care to explain why? Note that I'm talking specifically about the case where a raw string is being output (eg, using jq -r). In this scenario it is possible to maintain UTF-8 errors, but not possible to maintain UTF-16 errors, since when emitting raw strings, any 16-bit string data needs to be interpreted as UTF-16 and converted to UTF-8.

In general it's going to be quite unlikely that a string will have errors of both types, since usually ill-formed strings will have originated in a 16-bit string system (eg, JavaScript, Java, Qt, some Windows API) in which case you'll have UTF-16 errors or they will have originated in an 8-bit string system (eg, Go, C, some POSIX API, some data from a file) in which case you'll have UTF-8 errors.

Here's a contrived example with both sorts of errors:

# sample input
$ (echo -n '"_'; echo 'C0 80' | xxd -r -p; echo '_\ud800_"') | xxd
00000000: 225f c080 5f5c 7564 3830 305f 220a       "_.._\ud800_".
# pass through `jq`, emitting JSON, so UTF-16 errors can be represented in output
$ (echo -n '"_'; echo 'C0 80' | xxd -r -p; echo '_\ud800_"') | jq . | xxd
00000000: 225f c080 5f5c 7564 3830 305f 220a       "_.._\ud800_".
# pass through `jq`, emitting raw string, so UTF-16 errors can not be represented, replaced with `<EF BF BD>`
$ (echo -n '"_'; echo 'C0 80' | xxd -r -p; echo '_\ud800_"') | jq -r . | xxd
00000000: 5fc0 805f efbf bd5f 0a                   _.._..._.

@Maxdamantus
Copy link
Author

Yes it does cause this to happen. It could be a string with an invalid byte, which you end up passing to match, for example.

match doesn't currently see the special error representations. For ill-formed UTF-16 it will only see replacement characters. I'd need to look through the code to figure out if it does anything special with UTF-8 errors, but my changes don't affect whatever behaviour that has. If match is able to handle either kind of error, the errors will not appear as regular characters and will only be matchable by regexes that specifically refer to those errors.

Your own explanation specifically says that you convert to these illegal sequences.

I only "convert" to legal sequences (replacement characters—this is pretty normal behaviour in Unicode handling, and it's described in the Unicode chapter that's already been linked). The point of this changeset is to handle these errors and also to avoid conversion where possible, since such conversion causes unnecessary corruption of binary data.

@Maxdamantus
Copy link
Author

Maxdamantus commented Jan 12, 2022

Yes, but that isn't relevant. At issue is the production of invalid UTF-16 strings.

Invalid UTF-16 strings are only produced from already invalid UTF-16 strings. This can also be extrapolated from the Unicode chapter that's been referred to. They gave an example of concatenating invalid UTF-16 strings to create a valid UTF-16 string, but you can obviously also end up with another invalid UTF-16 string. It is not possible to concatenate valid Unicode strings and end up with an invalid Unicode string. If you're not concatenating strings though, no strings are "produced", only passed through, so there's no visible processing.

@ikegami
Copy link

ikegami commented Jan 12, 2022

I did not look at the code. Your the one who said it behaved differently in
the passage I quoted ("UTF-8 errors are maintained and UTF-16 errors are converted into replacement characters").
It should either maintain the errors for both encodings, and convert them into replacement characters for both.

Care to explain why?

Ah, your earlier passage indicated the fix treated UTF-8 and UTF-16 differently, but that wasn't accurate.

It sounds like you're saying jq always emits UTF-8, so you can't generate the original malformed UTF-16 substring. Makes sense.

@ikegami
Copy link

ikegami commented Jan 12, 2022

Invalid UTF-16 strings are only produced from already invalid UTF-16 strings

Yes, I know. Your code only generates an invalid output string when given an invalid input string. That's not in contention.

The issue is that it causes jq to generate an invalid string under some circumstance. That's a violation of Unicode.

They gave an example of concatenating invalid UTF-16 strings to create a valid UTF-16 string, but you can obviously also end up with another invalid UTF-16 string

You can, but you're not allowed to.

The paragraph specifically says you're allowed to concatenate things that aren't UTF-* to produce something that's valid UTF-*. The point of the paragraph is to say that it's ok for buffers to contain incomplete streams, etc.

It doesn't say you're allowed you produce invalid UTF-*. Quite the opposite, it's clearly stated that you must produce valid UTF-*.

@Maxdamantus
Copy link
Author

The paragraph specifically says you're allowed to concatenate things that aren't UTF-* to produce something that's valid UTF-*. The point of the paragraph is to say that it's ok for buffers to contain incomplete streams, etc.

How can a buffer (a Unicode string) contain ill-formed Unicode without being produced? Something produced that ill-formed Unicode string. These strings are used as examples in the Unicode chapter.

It doesn't say you're allowed you produce invalid UTF-*. Quite the opposite, it's clearly stated that you must produce valid UTF-*.

This seems like a very selective interpretation of the Unicode standard, and it goes against how strings work in typical programming languages. The chapter referred to basically aligns with how strings work in most programming languages [0], which is as a sequence of code units, either 8-bit (aka "bytes") or 16-bit. Such strings are not necessarily purported to be well-formed Unicode.

The only point of contention I can find between this changeset and the Unicode chaper is in this paragraph:

D80
Unicode string: A code unit sequence containing code units of a particular Unicode encoding form.
• In the rawest form, Unicode strings may be implemented simply as arrays of the appropriate integral data type, consisting of a sequence of code units lined up one immediately after the other.
• A single Unicode string must contain only code units from a single Unicode
encoding form. It is not permissible to mix forms within a string.

Since this changeset is trying to cater to both UTF-8 and UTF-16, it doesn't actually treat strings as sequences of code units, so arguably the "not permissible to mix forms" part does not apply. This at least applies to ill-formed jq strings—well-formed jq strings end up being represented internally as well-formed UTF-8 code units, though jq does not currently provide access to code units, only code points. A jq string in this changeset could be seen as consisting of a sequence of "Unicode strings" that are either well-formed (opaque Unicode code points) or erroneous UTF-8 code units or erroneous UTF-16 code units, and it's not possible for the UTF-8 errors and UTF-16 errors to be confused.


[0] At least ones like C, Go, Java, JavaScript, Python 2.7—interestingly, systems that try to add special "support" for Unicode like Python 3, Haskell and Rust have diverged from this model. Side note: personally, I prefer the 8-bit code units model described here, which has explicitly been adopted by Go, probably not coincidentally, given the overlap between designers of UTF-8 and Go.

@ikegami
Copy link

ikegami commented Jan 13, 2022

How can a buffer (a Unicode string) contain ill-formed Unicode without being produced?

The buffer in question doesn't contain Unicode. It contains bytes received. That's the whole point.

And here's how it can happen:

read(fd, buf, 5)   // UTF-8 Stream is 61 62 63 64 C3 A9

This seems like a very selective interpretation of the Unicode standard

No, it's extremely clear this is to what the paragraph refers.

and it goes against how strings work in typical programming languages.

No, it's extremely common to have buffers that contain partial strings.

The chapter referred to basically aligns with how strings work in most programming languages [0], which is as a sequence of code units, either 8-bit (aka "bytes") or 16-bit. Such strings are not necessarily purported to be well-formed Unicode.

It's very common for them to be other things too. C's wide char is 32-bit for me, allowing it to represent any Unicode Code Point as a single char. Perl and Python can also represent any Unicode Code Point as a single char.

None of this is relevant. The issue I brought up isn't with what you do internally. There's no problem there. The problem is that your changes cause jq to generate an illegal UTF-8 string. jq should not be generating illegal UTF-8.

@Maxdamantus
Copy link
Author

Maxdamantus commented Jan 14, 2022

The buffer in question doesn't contain Unicode. It contains bytes received. That's the whole point.

And here's how it can happen:

read(fd, buf, 5)   // UTF-8 Stream is 61 62 63 64 C3 A9

Okay, so by doing that you have "produced" what Unicode ch03 calls a "Unicode 8-bit string" (paragraphs D80, D81). It is ill-formed due to the lack of a continuation byte. Whether you produced it by doing 1 read of 5 bytes or 5 reads of 1 byte should be irrelevant.

If you're still really insistent that concatenation should be considered a special operation that must enforce well-formedness, consider this sequence of operations on your example string:

// NOTE: illegal UTF-8 bytes in the quoted string notation have been replaced with U+FFFD code points
s = <61 62 63 64 C3> // "abcd�"
s += <A9 61 63 C3> // "�ac�"
// s = <61 62 63 64 C3 A9 61 63 C3> ("abcdéac�")
s += <A0> // ("�")
// s = <61 62 63 64 C3 A9 61 63 C3 A0> ("abcdéacà")

The first concatenation is of the form invalid + invalid → invalid and the second concatenation is of the form invalid + invalid → valid. Any sane system that allows representation of these invalid Unicode strings will work in exactly the way described above. If a system decides to do a conversion on the invalid string produced through the first concatenation, the final result will almost certainly not be what is expected (probably "abcdéac��"). I'm not aware of any programming language that works the latter way. If there is one, that should really be considered a bug.

No, it's extremely clear this is to what the paragraph refers.

Feel free to point out where the Unicode standard says that concatenation or some other operation must produce a well-formed string. I think the standard makes a fairly clear distinction of creating strings that "purport" to be well-formed. Typical string operations such as concatenation don't normally purport their results to be well-formed under all circumstances.

and it goes against how strings work in typical programming languages.

No, it's extremely common to have buffers that contain partial strings.

.. This sounds supportive of the point I was making. My point was that typical programming languages allow strings that are not valid Unicode (for example allowing partial strings).

jq should not be generating illegal UTF-8.

If you still think there's a problem here after reading my hopefully clear demonstration above, can you give a precise example of where you think the behaviour of my branch should be different? I think by this point the behaviour should be fairly clear from the initial PR description as well as the supporting examples I've shown through this discussion.

@nicowilliams
Copy link
Contributor

I'll take a look at this. I haven't yet, but my idea for binary support was something like this:

  • have a binary "type" (not a first-class type) that is represented as an array of integers 0-255 in jq code, but internally is just a byte buffer
  • have frombinary/tobinary functions to convert UTF-8 strings from/to binary
  • tobinary should also take non-binary arrays of integers 0-255
  • make sure the base64 code will accept / output binary
  • have a --binary-input CLI option that reads the whole file in as binary
  • have a --binary-output CLI option that causes any binary outputs of the jq program to be emitted w/o conversion or interpretation

@Maxdamantus
Copy link
Author

Maxdamantus commented May 26, 2022

my idea for binary support was something like this:

My feeling is that most of the functionality you're suggesting (particularly, frombinary, tobinary) would still be useful on top of this PR. These functions can be implemented easily on top of this PR, since it supports storing binary data in string values (eg, [0, 1, 2, 255] | tobinary would emit a string).

As an optimisation, "foo" | frombinary could emit the binary type you're referring to instead of a regular array of numbers.

The --binary-input and --binary-output options could also still be useful, though they would be equivalent to using jq -sR 'frombinary | ...' and jq -j '... | tobinary' respectively.

(EDIT: I think I got frombinary/tobinary backwards, though this might indicate that it would be better to call them tobytes/frombytes or explodebytes/implodebytes respectively)

(And this PR already updates @base64 and @base64d to handle binary data)

@nicowilliams
Copy link
Contributor

@Maxdamantus sorry for the delay. I've started to look at this, but in parallel I've been re-thinking how to do binary and have started to implement something not unlike your PR just to experiment.

First, my immediate thoughts about this PR:

  • jq should not be the first thing sporting "WTF-8b" support -- is it your invention? are there other uses of it?
  • I'm not sure that we should steal from the over-long UTF-8 sequences to represent 0x80-0xff -- like you I want UTF-16 to die and go away and never come back, and then we could have more than 21 bits of code space for Unicode, so using overlong sequences would not be advisable if I had a magic wand with which to get my way!
    Maybe we could use private-use codepoints instead? Or the last 128 overlong sequences rather than the first?
  • I'm very surprised to see changes to src/builtin.c:f_json_parse()!
    Why not just add new functions for new options?

Having binary be represented as an array of small numbers in jq code is probably a bad idea -- arrays are everywhere in jq, and adding all sorts of branches for binary seems like not likely to be good for performance. String operations, I think, are probably less performance sensitive. As well, I'm a bit disappointed with explode not producing a stream. So here's my new idea, which I think we can use to add something like WTF-8 and so on like you're doing, possibly sharing code:

  • represent binary as a sub-kind of string
    • rename the pad_ field of jv to subkind, and add a enum type for its values with a JV_SUBKIND_BINARY
    • a jv that is of kind JV_KIND_STRING with the sub-kind JV_SUBKIND_BINARY would be a binary string value
    • add jv_binary_sized() and related
    • make all the jv functions like jv_string_concat(), jv_string_slice(), jv_string_append*() that deal with strings support binary strings (meaning: don't check for UTF-8 validity and don't replace bad codepoints)
    • adding a string and a binary string (or a binary string and a string) should yield a binary string
  • maybe maybe maybe let .[] be the operation that explodes a string into its codepoints as a stream
    • then let .[] on a binary string output the string's byte values, not codepoints
  • add tobinary for "converting" strings to binary, and make tostring on binary strings do bad codepoint replacement as needed
  • add a subtype function?
  • add isbinary to indicate whether a string is binary
  • add isutf8 to indicate whether a string (possibly binary) is valid UTF-8
  • make input_filename output binary when the filename is not valid UTF-8
  • JSON output should by default do bad codepoint replacements as needed on binary values then output them as strings
    • if you want to have binary base64-encoded, use the base64 encoder
    • but there could be an output mode that outputs binary values as if they were valid UTF-8 strings
  • add command-line options for new raw input modes:
    • newline-terminated binary (so, text, but not necessarily UTF-8)
    • binary with some record size
    • binary slurp (read the whole file as one big binary string)
  • add command-line output modes:
    • raw binary output (no codepoint replacements on top-level strings)
    • raw binary output with a separator byte (ditto; see discussions on -0)
    • WTF-8 output (so, JSON, but any strings which are "valid WTF-8" would get treated as if they were valid UTF-8
    • binary output (JSON with any binary strings treated as if they were valid UTF-8
    • maybe a WTF-8b output mode?

@nicowilliams
Copy link
Contributor

@Maxdamantus see #2736. What do you think?

@Maxdamantus
Copy link
Author

Maxdamantus commented Jul 20, 2023

@nicowilliams Thanks for looking, better late than never!

* jq should not be the first thing sporting "WTF-8b" support -- is it your invention?  are there other uses of it?

As far as I can tell, noone else has used this particular scheme, but it's hard to verify. I've referred to it as "WTF-8b" because it achieves the goals of both WTF-8 (able to encode UTF-16 errors) and UTF-8b (able to encode UTF-8 errors, probably better known as "surrogateescape"). I've encoded UTF-16 errors the same way as in WTF-8, but it's not possible to directly use the UTF-8b/surrogateescape encoding in conjunction with WTF-8.

I think the use case here is a bit niche. I'm not aware of other systems where it would be appropriate to support both UTF-8 and UTF-16 strings at the same time.

IMO jq is should at least handle JSON produced by UTF-16 systems (eg, JavaScript where strings such as "\udca9\ud83d" can be emitted).

Since jq is commonly used as a unix tool for passing around text, I think it should also handle text in ways that are compatible with other unix tools. Eg, if I run tail -n 20 foo.txt, I expect it to maintain the same bytes that were in the original file—it shouldn't fail due to illegal formatting or replace some of the bytes with replacement characters.

The closest example of something like this combined encoding I can think of is Java's "Modified UTF-8". This is used in compiled .class files for representing string constants. UTF-16 data is encoded mostly[0] the same way as WTF-8, but instead of encoding "\u0000" as a single byte "\x00", it's encoded using the first overlong coding, "\xC0\x80".

I'll also point out WTF-8b is intended as an internal encoding. In the current changeset, it is exposed through @uri, but I think it would be cleaner to transform the UTF-16 errors here into replacement characters. I have however found @uri useful for inspecting the internal representation. Theoretically it should be possible to change the internal representation from WTF-8b to something else (though I'm doubtful that a better encoding exists).

* I'm not sure that we should steal from the over-long UTF-8 sequences to represent `0x80`-`0xff` -- like you I want UTF-16 to die and go away and never come back, and then we could have more than 21 bits of code space for Unicode, so using overlong sequences would not be advisable if I had a magic wand with which to get my way!
  Maybe we could use private-use codepoints instead?  Or the _last_ 128 overlong sequences rather than the first?

I don't think private-use code points would be suitable, as these are valid Unicode, so it would be incorrect to reinterpret them as UTF-8 errors (or to interpret UTF-8 errors as valid Unicode), eg HERE 's something I found through GitHub code search, where there are private-use code points that presumably map to characters in a special font. These characters need to be distinguished from UTF-8 errors.

I'm not sure what the advantage would be of using the last 128 overlong sequences rather than the first ones. The last overlong sequences would be 4 bytes long, and there are 65,536 overlong 4-byte sequences (ie, the number of code points below U+10000, the first that is encoded as 4 bytes in UTF-8).

Using the first overlong sequences seems quite natural, since they are only 2 bytes long, and there are exactly 128 overlong 2-byte sequences, and exactly 128 possible unpaired UTF-8 bytes. It seems to match exactly, so if someone else were to independently create an encoding under the same constraints, they'd likely come up with the same mapping.

* I'm very surprised to see changes to `src/builtin.c`:`f_json_parse()`!
  Why not just add new functions for new options?

I think my intention was to avoid complicating the existing jv_parse_sized function or to avoid creating an extra exposed function (jv_parse_extended_sized?). I believe this was the part of the PR that I felt least comfortable with stylistically, so I'd be happy to revisit this.

So here's my new idea, which I think we can use to add something like WTF-8 and so on like you're doing, possibly sharing code:

I certainly think further operations on binary strings would be useful (my PR is not focused on providing such functions, but only adding representational support so that the data does not get corrupted or cause errors), but I feel like creating hard distinctions between different types of strings is unnecessary.

I think the main thing you'd want from "binary" strings is byte-based indexing instead of code point-based indexing.

jq currently does neither, but I see that your PR (#2736) is working towards code point-based indexing for normal strings, which makes sense (actually, IIUC, it only does iterating at the moment, but I think indexing is a logical next step, so that .[$x] is the same as [.[]][$x]).

Have you thought about making binary strings that are basically just alternative views over normal strings? So $str | .[] would emit code points[1], but $str | asbinary | .[] would emit bytes. $str | asbinary | asstring could just be a noop (it could force the UTF-16 errors into replacement characters, but I don't think it's very useful). If the indexing is meant to return a string (or binary string) instead of an integer, I think it would still be sensible for "\uD800 X"[0] and ("\uD800 X" | asbinary)[0] to return the same string (or binary string) representing that illegal UTF-16 surrogate (this adheres to the principle of preserving data, but in practice I think this is an extreme edge case, since you wouldn't normally perform binary operations on ill-formed UTF-16 data).

I suspect the alternative view method would be simpler implementation-wise, since having different string/binary representations probably means the C code will have to cater for the different representations (eg, @base64 is going to have to convert from normal strings and binary strings .. and what type does @base64d emit? at the moment it emits a normal string, which should iterate through the code points—changing it to return a binary string seems a bit presumtive).

[0] Actually, it's slightly different in that it treats all surrogates equally as errors, whether they are paired or not, so "💩" is first turned into the UTF-16 surrogates <D83D DCA9> and the surrogates are encoded individually.

[1] Not sure if these should be strings or integers. It looks like your PR makes them strings. I don't think I feel strongly either way, but I have a slight tendendency in favour of ("💩" | asbinary | .[0]) == 240, so for consistency I might expect ("💩" | .[0]) == 128169. I think in both cases, .[0:(. | length)] == . is reasonable (a slice will always produce the same type as the input).

@Maxdamantus Maxdamantus deleted the 210520-wtf8b branch July 21, 2023 22:06
@Maxdamantus Maxdamantus restored the 210520-wtf8b branch July 21, 2023 22:07
@Maxdamantus Maxdamantus reopened this Jul 21, 2023
@ikegami
Copy link

ikegami commented Jul 21, 2023 via email

@Maxdamantus
Copy link
Author

I've rebased this PR onto the current master without making any significant changes (only resolving a merge conflict with #2633). I intend to push some code improvements later.

@nicowilliams
Copy link
Contributor

I don't think private-use code points would be suitable, as these are

valid Unicode Indeed.

But on the plus side such output would be accepted and preserved by existing software.

where there are private-use code points that presumably map to characters

in a special font There's a group that assigned Klingon characters to a range of private-use code points.

I'm aware. Ideally we could get 128 codepoints assigned for this. Yes, you'd have to know to decode from UTF-8 using those to binary, but their presence would be sufficient to indicate that a string is in fact a binary blob.

@Maxdamantus
Copy link
Author

Yes, you'd have to know to decode from UTF-8 using those to binary, but their presence would be sufficient to indicate that a string is in fact a binary blob.

You would need an external flag in this case to denote that it's a binary blob, since it's possible for binary data to already contain bytes that look like an encoding of these hypothetical code points.

For example, imagine that U+E1XX is reserved for this purpose, where U+E180 would denote the binary byte <80>. If some binary input happens to contain <EE 86 80> (the UTF-8 encoding of U+E180), that will presumably get decoded as <80> on output, corrupting the binary data.

If an external flag is used, there isn't much point in reencoding the binary data, since it could be stored in plain binary form. The downside is that operations have to account for the different string representations (and concatenation between different representations would fail in some cases).

@nicowilliams
Copy link
Contributor

Yes, you'd have to know to decode from UTF-8 using those to binary, but their presence would be sufficient to indicate that a string is in fact a binary blob.

You would need an external flag in this case to denote that it's a binary blob, since it's possible for binary data to already contain bytes that look like an encoding of these hypothetical code points.

You'd have to know yes, and the point is that if jq produces it (unless it's a mode where it produces actual binary) then it's UTF-8. The "external flag" here is implied as long as you don't leak this "UTF-8" into contexts where binary is expected without first decoding.

We have this problem in spades. Filesystem related system calls in Unix are all 8-bit clean except for ASCII / and ASCII NUL, and so if you list a directory you've no idea what codeset each name is in unless you enforce a convention.

Anyways, I think for jq all of this is out of scope. WTF-8 might be in scope, but WTF-8b... I think would require socializing more.

The internal string representation is changed from UTF-8 with replacement
characters to a modified form of "WTF-8" that is able to distinctly encode
UTF-8 errors and UTF-16 errors.

This handles UTF-8 errors in raw string inputs and handles UTF-8 and UTF-16
errors in JSON input. UTF-16 errors (using "\uXXXX") and UTF-8 errors (using
the original raw bytes) are maintained when emitting JSON. When emitting raw
strings, UTF-8 errors are maintained and UTF-16 errors are converted into
replacement characters.
UTF-8 errors and UTF-16 errors that were previously encoded into the ends of
strings will now potentially be used to form correct code points.

This is mostly a matter of making string equality behave expectedly, since
without this normalisation, it is possible to produce `jv` strings that are
converted to UTF-8 or UTF-16 the same way but are not equal due well-formed
code units that may or may not be encoded as errors.
Errors are emitted as negative code points instead of being transformed into
replacement characters. `implode` is also updated accordingly so the original
string can be reconstructed without data loss.
This is no longer needed as strings are capable of storing partial UTF-8
sequences.
@Maxdamantus
Copy link
Author

I've made some updates to these changes:

  • f_json_parse reuses the parse function in jv_parse.c again, which has been refactored to work on both types of strings.
  • Various functions that I had previously called *_extended_* are now called *_wtf_*, since the intention is to denote that they work with the extended (internal) string representation, which happens to be WTF-8/WTF-8b. The previous names were more abstract (not referring to implementation details of the representation), but I think the new names are clearer.
  • A jvp_utf8_wtf_next_bytes function has been added which makes iteration through UTF-8 bytes simpler, where for well-formed strings, the entire content will be emitted in one chunk.
  • explode and implode preserve errors by using negative code points. I was thinking of having separate functions that do this, but given that until these changes, binary data hasn't really worked properly, it should be compatible with current usage of jq.
  • @uri is implemented properly, so the internal representation is no longer exposed for ill-formed strings.
  • The jvp_utf8_backtrack function is removed, since this was a workaround for the fact that strings couldn't hold partial UTF-8 data, but now they can.

@Maxdamantus Maxdamantus mentioned this pull request Jul 26, 2023
27 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants