Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to collect slices or masks of Parquet with nested data spanning across more than one data page #18400

Closed
2 tasks done
coastalwhite opened this issue Aug 27, 2024 · 0 comments · Fixed by #18407
Closed
2 tasks done
Assignees
Labels
accepted Ready for implementation bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@coastalwhite
Copy link
Collaborator

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import pyarrow.parquet as pq
import io

width = 4100
df = pl.DataFrame([
    pl.Series('a', [
        [i for i in range(width)],
        [i for i in range(width)],
    ], pl.Array(pl.Int64, width)),
])

f = io.BytesIO()
pq.write_table(
    df.to_arrow(),
    f,
    use_dictionary=False,
    data_page_size=1024,
    column_encoding={ 'a': 'PLAIN' },
)

f.seek(0)
print(pl.read_parquet(f, n_rows=1))

Log output

thread '<unnamed>' panicked at crates/polars-parquet/src/arrow/read/deserialize/mod.rs:78:22:
called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("values (of len 4096) must be a multiple of size (4100) in FixedSizeListArray."))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/home/johndoe/Projects/polars/repro.py", line 23, in <module>
    print(pl.read_parquet(f, n_rows=1))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/johndoe/Projects/polars/py-polars/polars/_utils/deprecation.py", line 91, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/johndoe/Projects/polars/py-polars/polars/_utils/deprecation.py", line 91, in wrapper
    return function(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/johndoe/Projects/polars/py-polars/polars/io/parquet/functions.py", line 170, in read_parquet
    return _read_parquet_binary(
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/johndoe/Projects/polars/py-polars/polars/io/parquet/functions.py", line 257, in _read_parquet_binary
    pydf = PyDataFrame.read_parquet(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("values (of len 4096) must be a multiple of size (4100) in FixedSizeListArray."))

Issue description

Our code currently assumes that nested structures span a single data page. This is not necessarily true.

Expected behavior

Proper collecting.

Installed versions

Replace this line with the output of pl.show_versions(). Leave the backticks in place.
@coastalwhite coastalwhite added bug Something isn't working python Related to Python Polars needs triage Awaiting prioritization by a maintainer labels Aug 27, 2024
coastalwhite added a commit to coastalwhite/polars that referenced this issue Aug 27, 2024
Fixes pola-rs#18400.

This completely rewrites the logic that deal with decoding filtered nested
values. This was needed to accommodate the fact that any kind of row value
might span several data pages. This was not really accounted for in the
previous implementation and where it worked, it was mostly by coincidence.

This also made it possible to add many fast paths for the decoder that were
previously not there. Therefore, I think there is also some performance
improvements here, but I am not sure how much.

I also wrote two tests to make sure that it actually works.
@c-peters c-peters added the accepted Ready for implementation label Sep 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants