Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can corrupt delta table commit file min/max values with binary column #2629

Open
echai58 opened this issue Jun 27, 2024 · 2 comments
Open

Can corrupt delta table commit file min/max values with binary column #2629

echai58 opened this issue Jun 27, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@echai58
Copy link

echai58 commented Jun 27, 2024

Environment

Delta-rs version: 0.18.1

Binding: python


Bug

What happened:
I encountered a bizarre exception

Exception: Json error: Invalid UTF-16 surrogate pair 1406143

when trying to load one of my delta tables. After doing some investigation, I discovered some really weird behavior around how delta-rs encodes binary in the commit file json, specifically for the min/maxValues.

Essentially, the way I understand how a binary gets encoded as a string is as follows: any byte whose codepoint is < 128 gets encoded as-is, and any byte whose codepoint > 128 gets encoded as its unicode codepoint.
e.g.

'hello world \xdd\x1c'

becomes

\"hello world \\u00dd\\u001c\"

However, when there happens to be a sequence of bytes in the binary that falls in the UTF-16 surrogate range, when delta-rs tries to decode the string, it sees the surrogate and tries to decode it as UTF-16, but since it's not actually valid UTF-16 surrogate, it fails.

For example:

from deltalake import DeltaTable, write_deltalake
import pandas as pd
import pyarrow as pa

bad_binary = b'hello \\uDd1c world \xdd\x1c'

df = pd.DataFrame.from_dict(
    {
        "p": [1],
        "k": [1],
        "v": [bad_binary],
    }
)

write_deltalake(
    "test-bad-binary",
    pa.table(df, schema=pa.schema([("p", pa.int64()), ("k", pa.int64()), ("v", pa.binary())])),
    partition_by="p",
    mode="append",
)

The write goes through, but now the DeltaTable is corrupt, and you cannot load the DeltaTable anymore. Inspecting the commit, we see

\"hello \\udd1c world \\u00dd\\u001c\"

and since \udd1c falls in the surrogate range, we see a json parse error about that.

Furthermore, the capitalized letter gets written as lowercase, which is just incorrect, but probably unrelated.

What you expected to happen:
When I write the same data using spark, I see the commit file generated by spark does not write any min/max values for binary columns. I expect this to be the correct resolution for this bug.

@echai58 echai58 added the bug Something isn't working label Jun 27, 2024
@abhiaagarwal
Copy link
Contributor

Can you set engine="rust" and see if the error replicates?

@echai58
Copy link
Author

echai58 commented Jun 28, 2024

Can you set engine="rust" and see if the error replicates?

@abhiaagarwal The error does not replicate with engine="rust". The commit file has the following:

\"hello \\\\\\\\uDd1c world \\\\xdd\\\\x1c\"

which is properly escaping the \ and is mantaining the casing correctly as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants