You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
when trying to load one of my delta tables. After doing some investigation, I discovered some really weird behavior around how delta-rs encodes binary in the commit file json, specifically for the min/maxValues.
Essentially, the way I understand how a binary gets encoded as a string is as follows: any byte whose codepoint is < 128 gets encoded as-is, and any byte whose codepoint > 128 gets encoded as its unicode codepoint.
e.g.
'hello world \xdd\x1c'
becomes
\"hello world \\u00dd\\u001c\"
However, when there happens to be a sequence of bytes in the binary that falls in the UTF-16 surrogate range, when delta-rs tries to decode the string, it sees the surrogate and tries to decode it as UTF-16, but since it's not actually valid UTF-16 surrogate, it fails.
For example:
from deltalake import DeltaTable, write_deltalake
import pandas as pd
import pyarrow as pa
bad_binary = b'hello \\uDd1c world \xdd\x1c'
df = pd.DataFrame.from_dict(
{
"p": [1],
"k": [1],
"v": [bad_binary],
}
)
write_deltalake(
"test-bad-binary",
pa.table(df, schema=pa.schema([("p", pa.int64()), ("k", pa.int64()), ("v", pa.binary())])),
partition_by="p",
mode="append",
)
The write goes through, but now the DeltaTable is corrupt, and you cannot load the DeltaTable anymore. Inspecting the commit, we see
\"hello \\udd1c world \\u00dd\\u001c\"
and since \udd1c falls in the surrogate range, we see a json parse error about that.
Furthermore, the capitalized letter gets written as lowercase, which is just incorrect, but probably unrelated.
What you expected to happen:
When I write the same data using spark, I see the commit file generated by spark does not write any min/max values for binary columns. I expect this to be the correct resolution for this bug.
The text was updated successfully, but these errors were encountered:
Environment
Delta-rs version: 0.18.1
Binding: python
Bug
What happened:
I encountered a bizarre exception
when trying to load one of my delta tables. After doing some investigation, I discovered some really weird behavior around how delta-rs encodes binary in the commit file json, specifically for the min/maxValues.
Essentially, the way I understand how a binary gets encoded as a string is as follows: any byte whose codepoint is < 128 gets encoded as-is, and any byte whose codepoint > 128 gets encoded as its unicode codepoint.
e.g.
becomes
However, when there happens to be a sequence of bytes in the binary that falls in the UTF-16 surrogate range, when delta-rs tries to decode the string, it sees the surrogate and tries to decode it as UTF-16, but since it's not actually valid UTF-16 surrogate, it fails.
For example:
The write goes through, but now the DeltaTable is corrupt, and you cannot load the DeltaTable anymore. Inspecting the commit, we see
and since
\udd1c
falls in the surrogate range, we see a json parse error about that.Furthermore, the capitalized letter gets written as lowercase, which is just incorrect, but probably unrelated.
What you expected to happen:
When I write the same data using spark, I see the commit file generated by spark does not write any min/max values for binary columns. I expect this to be the correct resolution for this bug.
The text was updated successfully, but these errors were encountered: