Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.sink_parquet() sometimes panics when statistics has "null_count": False #17306

Open
2 tasks done
etiennebacher opened this issue Jun 30, 2024 · 1 comment
Open
2 tasks done
Labels
A-io-parquet Area: reading/writing Parquet files A-panic Area: code that results in panic exceptions bug Something isn't working P-low Priority: low python Related to Python Polars

Comments

@etiennebacher
Copy link

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import os
os.environ["POLARS_VERBOSE"] = "1"
import polars as pl

test = pl.LazyFrame({"a": [1]})

test.sink_parquet("foo.parquet", statistics={"null_count": False, "min": True, "max": False, "distinct_count": True})

Log output

RUN STREAMING PIPELINE
[df -> parquet_sink]
thread '<unnamed>' panicked at crates/polars-pipe/src/executors/sinks/output/parquet.rs:47:33:
called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("parquet: File out of specification: null count of a page is required"))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
thread '<unnamed>' panicked at crates/polars-pipe/src/executors/sinks/output/parquet.rs:127:14:
called `Result::unwrap()` on an `Err` value: Any { .. }
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/etienne/.local/lib/python3.10/site-packages/polars/_utils/unstable.py", line 58, in wrapper
    return function(*args, **kwargs)
  File "/home/etienne/.local/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 2233, in sink_parquet
    return lf.sink_parquet(
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: Any { .. }

Issue description

sink_parquet() panics when null_count is False in the argument statistics, but only when other values of statistics are provided. For example, this works:

test.sink_parquet("foo.parquet", statistics={"null_count": False})

but this panics:

test.sink_parquet("foo.parquet", statistics={"null_count": False, "min": True, "max": False, "distinct_count": True})

Expected behavior

Should work or give a proper error instead of panicking.

Installed versions

--------Version info---------
Polars:               1.0.0-rc.2
Index type:           UInt32
Platform:             Linux-6.5.0-41-generic-x86_64-with-glibc2.35
Python:               3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                1.21.5
openpyxl:             <not installed>
pandas:               <not installed>
pyarrow:              <not installed>
pydantic:             <not installed>
pyiceberg:            <not installed>
sqlalchemy:           <not installed>
torch:                <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@etiennebacher etiennebacher added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jun 30, 2024
@etiennebacher
Copy link
Author

etiennebacher commented Jun 30, 2024

This produces a proper error for write_parquet():

import os
os.environ["POLARS_VERBOSE"] = "1"
import polars as pl

test = pl.DataFrame({"a": [1]})

test.write_parquet("foo.parquet", statistics={"null_count": False, "min": True, "max": False, "distinct_count": True})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/etienne/.local/lib/python3.10/site-packages/polars/dataframe/frame.py", line 3554, in write_parquet
    self._df.write_parquet(
polars.exceptions.ComputeError: parquet: File out of specification: null count of a page is required

@coastalwhite coastalwhite added P-low Priority: low A-io-parquet Area: reading/writing Parquet files A-panic Area: code that results in panic exceptions and removed needs triage Awaiting prioritization by a maintainer labels Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-parquet Area: reading/writing Parquet files A-panic Area: code that results in panic exceptions bug Something isn't working P-low Priority: low python Related to Python Polars
Projects
Status: Ready
Development

No branches or pull requests

2 participants