Files containing binary data with >=8_388_855 bytes per row written with `arrow-rs` can't be read with `pyarrow`

Hello,

internally, we wrote an own library that wraps `arrow-rs` to make it useable from Python.
Such a thing also exists publicly available through `arro3` which I used here for some minimal reproducible example:

```
import pyarrow.parquet
import arro3.io

data = [8388855, 8388924, 8388853, 8388880, 8388876, 8388879]

schema = pyarrow.schema([pyarrow.field("html", pyarrow.binary())])
data = [{"html": b"0" * d} for d in data]

t = pyarrow.Table.from_pylist(data, schema=schema)

path = "/tmp/foo.parquet"
with open(path, "wb") as file:
    for b in t.to_batches():
        arro3.io.write_parquet(b, file, max_row_group_size=len(data) - 3)

reader = pyarrow.parquet.ParquetFile(path)
for i in range(2):
    print(len(reader.read_row_group(i)))
```

This code writes a bit of dummy binary data through `arrow-rs`. Reading that with `pyarrow` results in

```
  File "pyarrow/_parquet.pyx", line 1655, in pyarrow._parquet.ParquetReader.read_row_group
  File "pyarrow/_parquet.pyx", line 1691, in pyarrow._parquet.ParquetReader.read_row_groups
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: Couldn't deserialize thrift: No more data to read.
Deserializing page header failed.
```

---

### Observations

- Reading in the same file through `arro3` or own internal library wrapping `arrow-rs` works just fine
- Reading in the same file through `duckdb` also works just fine
- Reducing the amount of binary data per row slightly leads to the error disappearing (8_388_855 per row or 25_166_565 per row group or more seems to be the problematic amount)
- Issue is reproducible with `pyarrow` version `18.1.0`, `19.0.1` and `20.0.0` 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Files containing binary data with >=8_388_855 bytes per row written with `arrow-rs` can't be read with `pyarrow` #7489

Observations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Files containing binary data with >=8_388_855 bytes per row written with arrow-rs can't be read with pyarrow #7489

Description

Observations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Files containing binary data with >=8_388_855 bytes per row written with `arrow-rs` can't be read with `pyarrow` #7489