Skip to content

Conversation

@fivetran-felixhuang
Copy link
Collaborator

@fivetran-felixhuang fivetran-felixhuang commented Nov 14, 2025

At the moment, the string in a ByteString is tranpiled to a string with the escape syntax e'...'. However, DuckDB has limited support for e'...'

We need to handle the escape sequences in the ByteString input correctly, while also making sure the resulting DuckDB query produces the same result as the original BigQuery query

To handle escape sequences, we can use the ::blob operator, and to handle other possible utf-8 characters, we can use the encode() function in DuckDB. We have to use ::blob and encode() for different input segments.

For one, ::blob doesn't handle utf-8 characters after the first 256 ones (such as 数).

Also, while encode can handle escape sequences, it treats them as string literals instead of actual bytes, so the resulting query can produce different values. For example, MD5(b"Mixed\x00Texÿt") in BQ and base64(UNHEX(MD5(ENCODE('Mixed\x00Texÿt')))) in DuckDB produce different outputs

The strategy here is to handle escape sequences and other segments separately and differently, and concatenate them as the output

Examples of BQ to DuckDB

MD5(b"Mixed\x00\x00Texÿt") -> UNHEX(MD5(ENCODE('Mixed') || '\x00\x00'::BLOB || ENCODE('Texÿt')))
so "Mixed\x00\x00Texÿt" is broken into ['Mixed', '\x00\x00', 'Texÿt']

MD5(b'\x00ÿ\x00') -> UNHEX(MD5('\x00'::BLOB || ENCODE('ÿ') || '\x00'::BLOB))
MD5(b'ÿ数据') -> UNHEX(MD5(ENCODE('ÿ数据')))
MD5(B"Hello World") -> UNHEX(MD5(ENCODE('Hello World')))

@georgesittas
Copy link
Collaborator

@fivetran-felixhuang let's discuss this on Monday. This approach feels too complicated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants