You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a large parquet file (2 billion rows) in entity-attribute-value format, where data expands as rows instead of columns. The reason we do that is because we can have hundreds of thousands of attributes per entity.
Ie. Instead of being formatted as:
row_id
name
gender
age
1
Alice
f
21
2
Bob
m
22
3
Joe
m
23
It's formatted as:
row_id
attribute
value
1
name
Alice
2
name
Bob
3
name
Joe
1
gender
f
2
gender
m
3
gender
m
1
age
21
2
age
22
3
age
23
The data is produced as a single parquet file with row groups by attribute. The read patterns are by attribute, we never scan the whole file, so the expectation is that polars can filter the data by row group and return the values without reading the whole file in memory (similar to what DuckDB does well).
Expected behavior
When filtering a parquet file by a column used as row group, it uses row group push down to efficiently read the data.
Checks
Reproducible example
Log output
No response
Issue description
I have a large parquet file (2 billion rows) in entity-attribute-value format, where data expands as rows instead of columns. The reason we do that is because we can have hundreds of thousands of attributes per entity.
Ie. Instead of being formatted as:
It's formatted as:
The data is produced as a single parquet file with row groups by
attribute
. The read patterns are by attribute, we never scan the whole file, so the expectation is that polars can filter the data by row group and return the values without reading the whole file in memory (similar to what DuckDB does well).Expected behavior
When filtering a parquet file by a column used as row group, it uses row group push down to efficiently read the data.
Installed versions
The text was updated successfully, but these errors were encountered: