Parquet row group filter pushdown not working as expected #15356

fpixl · 2024-03-28T12:20:03Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

con = duckdb.connect(database=":memory:")
con.sql(
    """
    SELECT * FROM read_parquet('1BilRowData.parquet')
    WHERE attribute = 'gender';
    """
)

df = pl.scan_parquet("1BilRowData.parquet")
df = df.filter(pl.col("attribute") == "gender").collect()

Log output

No response

Issue description

I have a large parquet file (2 billion rows) in entity-attribute-value format, where data expands as rows instead of columns. The reason we do that is because we can have hundreds of thousands of attributes per entity.
Ie. Instead of being formatted as:

row_id	name	gender	age
1	Alice	f	21
2	Bob	m	22
3	Joe	m	23

It's formatted as:

row_id	attribute	value
1	name	Alice
2	name	Bob
3	name	Joe
1	gender	f
2	gender	m
3	gender	m
1	age	21
2	age	22
3	age	23

The data is produced as a single parquet file with row groups by attribute. The read patterns are by attribute, we never scan the whole file, so the expectation is that polars can filter the data by row group and return the values without reading the whole file in memory (similar to what DuckDB does well).

Expected behavior

When filtering a parquet file by a column used as row group, it uses row group push down to efficiently read the data.

Installed versions

--------Version info---------
Polars:               0.20.16
Index type:           UInt32
Platform:             macOS-13.6-x86_64-i386-64bit
Python:               3.11.5 (main, Sep 22 2023, 10:10:52) [Clang 14.0.0 (clang-1400.0.29.202)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2023.12.2
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.1
pyarrow:              15.0.2
pydantic:             2.6.4
pyiceberg:            0.6.0
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

The text was updated successfully, but these errors were encountered:

ritchie46 · 2024-03-28T13:52:45Z

When filtering a parquet file by a column used as row group, it uses row group push down to efficiently read the data.

It does. Can you share a MRE that shows/confirms a bug?

ritchie46 · 2024-03-30T13:15:27Z

We cannot do anything with the bug report as it is now.

fpixl added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Mar 28, 2024

ritchie46 added invalid A bug report that is not actually a bug and removed needs triage Awaiting prioritization by a maintainer labels Mar 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet row group filter pushdown not working as expected #15356

Parquet row group filter pushdown not working as expected #15356

fpixl commented Mar 28, 2024

ritchie46 commented Mar 28, 2024

ritchie46 commented Mar 30, 2024

Parquet row group filter pushdown not working as expected #15356

Parquet row group filter pushdown not working as expected #15356

Comments

fpixl commented Mar 28, 2024

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions

ritchie46 commented Mar 28, 2024

ritchie46 commented Mar 30, 2024