Feat(duckdb): handle transpilation into DuckDB from ByteString type #6329
+36
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
At the moment, the string in a ByteString is tranpiled to a string with the escape syntax e'...'. However, DuckDB has limited support for e'...'
We need to handle the escape sequences in the ByteString input correctly, while also making sure the resulting DuckDB query produces the same result as the original BigQuery query
To handle escape sequences, we can use the ::blob operator, and to handle other possible utf-8 characters, we can use the encode() function in DuckDB. We have to use ::blob and encode() for different input segments.
For one, ::blob doesn't handle utf-8 characters after the first 256 ones (such as 数).
Also, while encode can handle escape sequences, it treats them as string literals instead of actual bytes, so the resulting query can produce different values. For example, MD5(b"Mixed\x00Texÿt") in BQ and base64(UNHEX(MD5(ENCODE('Mixed\x00Texÿt')))) in DuckDB produce different outputs
The strategy here is to handle escape sequences and other segments separately and differently, and concatenate them as the output
Examples of BQ to DuckDB
MD5(b"Mixed\x00\x00Texÿt") -> UNHEX(MD5(ENCODE('Mixed') || '\x00\x00'::BLOB || ENCODE('Texÿt')))
so "Mixed\x00\x00Texÿt" is broken into ['Mixed', '\x00\x00', 'Texÿt']
MD5(b'\x00ÿ\x00') -> UNHEX(MD5('\x00'::BLOB || ENCODE('ÿ') || '\x00'::BLOB))
MD5(b'ÿ数据') -> UNHEX(MD5(ENCODE('ÿ数据')))
MD5(B"Hello World") -> UNHEX(MD5(ENCODE('Hello World')))