`read_parquet([files], ..., use_pyarrow=False)` fails when reading multiple files with different schema #13438

mcrumiller · 2024-01-04T15:40:41Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
from pathlib import Path

pl.DataFrame({
    "a": [1, 2, 3],
    "b": [1, 2, 3] ,
}).write_parquet("df1.pqt")

pl.DataFrame({
    "a": [4, 5, 6],
    "c": [4, 5, 6],
}).write_parquet("df2.pqt")

pl.read_parquet(["df1.pqt", "df2.pqt"], use_pyarrow=True)   # succeeds
pl.read_parquet(["df1.pqt", "df2.pqt"], use_pyarrow=False)  # fails

Log output

thread '<unnamed>' panicked at polars/crates/polars-parquet/src/arrow/read/deserialize/mod.rs:141:31:
called `Option::unwrap()` on a `None` value
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "polars/py-polars/scripts/check_pq.py", line 14, in <module>
    pl.read_parquet(["df1.pqt", "df2.pqt"], columns=["a"], use_pyarrow=False)
  File "polars/py-polars/polars/io/parquet/functions.py", line 183, in read_parquet
    return lf.collect(no_optimization=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mcrumiller/projects/polars/py-polars/polars/lazyframe/frame.py", line 1749, in collect
    return wrap_df(ldf.collect())
                   ^^^^^^^^^^^^^
pyo3_runtime.PanicException: called `Option::unwrap()` on a `None` value

Issue description

More general description of ##13436. When providing pyarrow=False, read_parquet fails when pyarrow is not used.

Expected behavior

Should return df with single column a since this is the only column column to both parquet files.

Installed versions

--------Version info---------
Polars:               0.20.3
Index type:           UInt32
Platform:             Linux-5.15.133.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python:               3.11.3 (main, Apr 15 2023, 14:44:51) [GCC 11.3.0]

----Optional dependencies----
adbc_driver_manager:  0.8.0
cloudpickle:          3.0.0
connectorx:           0.3.2
deltalake:            0.15.0
fsspec:               2023.12.2
gevent:               23.9.1
hvplot:               0.9.1
matplotlib:           3.8.2
numpy:                1.26.3
openpyxl:             3.1.2
pandas:               2.1.4
pyarrow:              14.0.2
pydantic:             2.5.3
pyiceberg:            0.5.1
pyxlsb:               1.0.10
sqlalchemy:           2.0.25
xlsx2csv:             0.8.1
xlsxwriter:           3.1.9

The text was updated successfully, but these errors were encountered:

ion-elgreco · 2024-01-04T20:22:04Z

I proposed something related to this: #13086

mcrumiller added bug Something isn't working python Related to Python Polars labels Jan 4, 2024

mcrumiller mentioned this issue Jan 4, 2024

Option::unwrap() panic when loading Parquet files with non-matching schemas as a list of files #13436

Closed

2 tasks

stinodego added the needs triage Awaiting prioritization by a maintainer label Jan 13, 2024

stinodego added the A-io-parquet Area: reading/writing Parquet files label Jan 21, 2024

nameexhaustion mentioned this issue Jul 1, 2024

fix: Raise proper error for mismatching parquet schema instead of panicking #17321

Merged

ritchie46 closed this as completed in #17321 Jul 2, 2024

c-peters added the accepted Ready for implementation label Jul 8, 2024

c-peters assigned nameexhaustion Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`read_parquet([files], ..., use_pyarrow=False)` fails when reading multiple files with different schema #13438

`read_parquet([files], ..., use_pyarrow=False)` fails when reading multiple files with different schema #13438

mcrumiller commented Jan 4, 2024 •

edited

Loading

ion-elgreco commented Jan 4, 2024

read_parquet([files], ..., use_pyarrow=False) fails when reading multiple files with different schema #13438

read_parquet([files], ..., use_pyarrow=False) fails when reading multiple files with different schema #13438

Comments

mcrumiller commented Jan 4, 2024 • edited Loading

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

ion-elgreco commented Jan 4, 2024

`read_parquet([files], ..., use_pyarrow=False)` fails when reading multiple files with different schema #13438

`read_parquet([files], ..., use_pyarrow=False)` fails when reading multiple files with different schema #13438

mcrumiller commented Jan 4, 2024 •

edited

Loading