Closed
Description
Hello,
internally, we wrote an own library that wraps arrow-rs
to make it useable from Python.
Such a thing also exists publicly available through arro3
which I used here for some minimal reproducible example:
import pyarrow.parquet
import arro3.io
data = [8388855, 8388924, 8388853, 8388880, 8388876, 8388879]
schema = pyarrow.schema([pyarrow.field("html", pyarrow.binary())])
data = [{"html": b"0" * d} for d in data]
t = pyarrow.Table.from_pylist(data, schema=schema)
path = "/tmp/foo.parquet"
with open(path, "wb") as file:
for b in t.to_batches():
arro3.io.write_parquet(b, file, max_row_group_size=len(data) - 3)
reader = pyarrow.parquet.ParquetFile(path)
for i in range(2):
print(len(reader.read_row_group(i)))
This code writes a bit of dummy binary data through arrow-rs
. Reading that with pyarrow
results in
File "pyarrow/_parquet.pyx", line 1655, in pyarrow._parquet.ParquetReader.read_row_group
File "pyarrow/_parquet.pyx", line 1691, in pyarrow._parquet.ParquetReader.read_row_groups
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: Couldn't deserialize thrift: No more data to read.
Deserializing page header failed.
Observations
- Reading in the same file through
arro3
or own internal library wrappingarrow-rs
works just fine - Reading in the same file through
duckdb
also works just fine - Reducing the amount of binary data per row slightly leads to the error disappearing (8_388_855 per row or 25_166_565 per row group or more seems to be the problematic amount)
- Issue is reproducible with
pyarrow
version18.1.0
,19.0.1
and20.0.0