Skip to content

Files containing binary data with >=8_388_855 bytes per row written with arrow-rs can't be read with pyarrow #7489

Closed
@jonded94

Description

@jonded94

Hello,

internally, we wrote an own library that wraps arrow-rs to make it useable from Python.
Such a thing also exists publicly available through arro3 which I used here for some minimal reproducible example:

import pyarrow.parquet
import arro3.io

data = [8388855, 8388924, 8388853, 8388880, 8388876, 8388879]

schema = pyarrow.schema([pyarrow.field("html", pyarrow.binary())])
data = [{"html": b"0" * d} for d in data]

t = pyarrow.Table.from_pylist(data, schema=schema)

path = "/tmp/foo.parquet"
with open(path, "wb") as file:
    for b in t.to_batches():
        arro3.io.write_parquet(b, file, max_row_group_size=len(data) - 3)

reader = pyarrow.parquet.ParquetFile(path)
for i in range(2):
    print(len(reader.read_row_group(i)))

This code writes a bit of dummy binary data through arrow-rs. Reading that with pyarrow results in

  File "pyarrow/_parquet.pyx", line 1655, in pyarrow._parquet.ParquetReader.read_row_group
  File "pyarrow/_parquet.pyx", line 1691, in pyarrow._parquet.ParquetReader.read_row_groups
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: Couldn't deserialize thrift: No more data to read.
Deserializing page header failed.

Observations

  • Reading in the same file through arro3 or own internal library wrapping arrow-rs works just fine
  • Reading in the same file through duckdb also works just fine
  • Reducing the amount of binary data per row slightly leads to the error disappearing (8_388_855 per row or 25_166_565 per row group or more seems to be the problematic amount)
  • Issue is reproducible with pyarrow version 18.1.0, 19.0.1 and 20.0.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugparquetChanges to the parquet crate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions