Cirq-ionq uses unparseable ASCII character in encoding #5216
Description
Filing this as both a slight-annoyance we plan on tackling and a 'buganizer Wonderbug' for @dabacon since I can't bug him on buganizer for real anymore ;)
Cirq uses unparseable ASCII characters in its serialization format for IonQ. This is not actively causing issues, but it has broken us in the past. We would like to stop using them, and this bug tells the story of why.
- How should it (even) work?
It's explained in a comment here: https://github.com/quantumlib/Cirq/blob/master/cirq-ionq/cirq_ionq/serializer.py#L241-L246
Each key and targets are serialized into a string of the form `key` + the ASCII unit
separator (chr(31)) + targets as a comma separated value. These are then combined
into a string with a seperator character of the ASCII record separator (chr(30)).
Finally this full string is serialized as the values in the metadata dict with keys
given by `measurementX` for X = 0,1, .. 9 and X large enough to contain the entire
string.
Okay, cool, we're using some ASCII control point characters (Unit Separator, or the 31st/1F-th ASCII character, and Record Separator, the 30th/1E-th). These are characters that don't ever occur in any normal text, they're part of ASCII, so they work even if they're being serialized into a non-unicode-safe storage layer. What could go wrong?
- Why it doesn't work
"Modern" languages really don't like those characters in JSON. Here's an example, using vanilla python 3.9
z = '{ "foo": "b' + chr(31) + 'ar" }'
json.loads(z)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.9/json/__init__.py", line 346, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.9/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.9/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid control character at: line 1 column 12 (char 11)
You get similar results in other languages (in javascript for example).
That's a bummer, the trick to use valid ASCII breaks parsing of JSON, and since we want to embed these strings in JSON, that's a problem.
But wait, it's like this, right now! Which brings us to
- Why it works anyway
Cirq/cirq-ionq/cirq_ionq/job.py
Lines 167 to 170 in fe7fe4e
Cirq requires these characters when it splits the metadata on a returned job to get the measurement keys out.
If this is true, how does it work? The character is needed here, but /can't/ be returned by the API, or browsers/clients everywhere will choke on the data!
The answer is implicit unicode conversion!
If you dump the response you get from api.ionq.co you'll see
0060: 22 2c 22 6d 65 74 61 64 61 74 61 22 3a 7b 22 6d ","metadata":{"m
0070: 65 61 73 75 72 65 6d 65 6e 74 30 22 3a 22 78 5c easurement0":"x\
0080: 75 30 30 31 66 30 5c 75 30 30 31 65 79 5c 75 30 u001f0\u001ey\u0
0090: 30 31 66 31 22 2c 22 73 68 6f 74 73 22 3a 22 31 01f1","shots":"1
00a0: 30 30 30 22 7d 2c 22 70 72 65 64 69 63 74 65 64 000"},"predicted
Ah, so it's using escape sequences \u001f and \u001e, not chr(31) at all!
When python serializes to json, it handles this properly:
x = {'foo': 'bar\x1f'}
json.dumps(x)
'{"foo": "bar\\u001f"}'
And that's how we see it at the API itself.
However, we've now lost the original motivation for using these characters! It's not ASCII at all. And if a change were made to support only ASCII, it might even be reasonable to encode the string as '\x1f', and, well, bad things result.
So this is the story of how we ended up trying to use ASCII, fell backward into unicode without realizing, and are now in the uncomfortable position of requiring the client and server to be unicode-aware.
Metadata
Assignees
Labels
Type
Projects
Status
No status