Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet row group filter pushdown not working as expected #15356

Open
2 tasks done
fpixl opened this issue Mar 28, 2024 · 2 comments
Open
2 tasks done

Parquet row group filter pushdown not working as expected #15356

fpixl opened this issue Mar 28, 2024 · 2 comments
Labels
bug Something isn't working invalid A bug report that is not actually a bug python Related to Python Polars

Comments

@fpixl
Copy link

fpixl commented Mar 28, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

con = duckdb.connect(database=":memory:")
con.sql(
    """
    SELECT * FROM read_parquet('1BilRowData.parquet')
    WHERE attribute = 'gender';
    """
)
df = pl.scan_parquet("1BilRowData.parquet")
df = df.filter(pl.col("attribute") == "gender").collect()

Log output

No response

Issue description

I have a large parquet file (2 billion rows) in entity-attribute-value format, where data expands as rows instead of columns. The reason we do that is because we can have hundreds of thousands of attributes per entity.
Ie. Instead of being formatted as:

row_id name gender age
1 Alice f 21
2 Bob m 22
3 Joe m 23

It's formatted as:

row_id attribute value
1 name Alice
2 name Bob
3 name Joe
1 gender f
2 gender m
3 gender m
1 age 21
2 age 22
3 age 23

The data is produced as a single parquet file with row groups by attribute. The read patterns are by attribute, we never scan the whole file, so the expectation is that polars can filter the data by row group and return the values without reading the whole file in memory (similar to what DuckDB does well).

Expected behavior

When filtering a parquet file by a column used as row group, it uses row group push down to efficiently read the data.

Installed versions

--------Version info---------
Polars:               0.20.16
Index type:           UInt32
Platform:             macOS-13.6-x86_64-i386-64bit
Python:               3.11.5 (main, Sep 22 2023, 10:10:52) [Clang 14.0.0 (clang-1400.0.29.202)]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2023.12.2
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.1
pyarrow:              15.0.2
pydantic:             2.6.4
pyiceberg:            0.6.0
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@fpixl fpixl added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Mar 28, 2024
@ritchie46
Copy link
Member

When filtering a parquet file by a column used as row group, it uses row group push down to efficiently read the data.

It does. Can you share a MRE that shows/confirms a bug?

@ritchie46 ritchie46 added invalid A bug report that is not actually a bug and removed needs triage Awaiting prioritization by a maintainer labels Mar 30, 2024
@ritchie46
Copy link
Member

We cannot do anything with the bug report as it is now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working invalid A bug report that is not actually a bug python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants