Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expressions result in different serializations across Python Polars PATCH version upgrades #16821

Open
2 tasks done
hiideaki opened this issue Jun 8, 2024 · 1 comment
Open
2 tasks done
Labels
enhancement New feature or an improvement of an existing feature needs decision Awaiting decision by a maintainer python Related to Python Polars

Comments

@hiideaki
Copy link

hiideaki commented Jun 8, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

Below is the Python code used to serialize the expression to JSON in different Polars versions:

import polars as pl

print(pl.__version__)
handle_nulls = pl.any_horizontal(pl.all().is_null())
expr = (
    pl
        .when(handle_nulls)
        .then(pl.lit(-1))
        .otherwise(
            pl.lit(100)
            + pl.col("input_column_1").fill_null(1)
        )
).alias("output_column")

to_filename = "expr_" + str(pl.__version__).replace(".", "_") + ".json"
expr.meta.serialize(to_filename)

with open(to_filename, "r") as f:
    print(f.read())

Here are the versions of Polars in which the script was run, and their respective outputs of the serialized expression.

0.20.21
{"Alias":[{"Ternary":{"predicate":{"Function":{"input":[{"Function":{"input":["Wildcard"],"function":{"Boolean":"IsNull"},"options":{"collect_groups":"ElementWise","fmt_str":"","input_wildcard_expansion":false,"returns_scalar":false,"cast_to_supertypes":false,"allow_rename":false,"pass_name_to_apply":false,"changes_length":false,"check_lengths":true,"allow_group_aware":true}}}],"function":{"Boolean":"AnyHorizontal"},"options":{"collect_groups":"ElementWise","fmt_str":"","input_wildcard_expansion":true,"returns_scalar":false,"cast_to_supertypes":false,"allow_rename":true,"pass_name_to_apply":false,"changes_length":false,"check_lengths":true,"allow_group_aware":true}}},"truthy":{"Literal":{"Int32":-1}},"falsy":{"BinaryExpr":{"left":{"Literal":{"Int32":100}},"op":"Plus","right":{"Function":{"input":[{"Column":"input_column_1"},{"Literal":{"Int32":1}}],"function":{"FillNull":{"super_type":"Unknown"}},"options":{"collect_groups":"ElementWise","fmt_str":"","input_wildcard_expansion":false,"returns_scalar":false,"cast_to_supertypes":true,"allow_rename":false,"pass_name_to_apply":false,"changes_length":false,"check_lengths":true,"allow_group_aware":true}}}}}}},"output_column"]}

0.20.22
{"Alias":[{"Ternary":{"predicate":{"Function":{"input":[{"Function":{"input":["Wildcard"],"function":{"Boolean":"IsNull"},"options":{"collect_groups":"ElementWise","fmt_str":"","input_wildcard_expansion":false,"returns_scalar":false,"cast_to_supertypes":false,"allow_rename":false,"pass_name_to_apply":false,"changes_length":false,"check_lengths":true,"allow_group_aware":true}}}],"function":{"Boolean":"AnyHorizontal"},"options":{"collect_groups":"GroupWise","fmt_str":"","input_wildcard_expansion":true,"returns_scalar":false,"cast_to_supertypes":false,"allow_rename":false,"pass_name_to_apply":false,"changes_length":false,"check_lengths":true,"allow_group_aware":true}}},"truthy":{"Literal":{"Int32":-1}},"falsy":{"BinaryExpr":{"left":{"Literal":{"Int32":100}},"op":"Plus","right":{"Function":{"input":[{"Column":"input_column_1"},{"Literal":{"Int32":1}}],"function":{"FillNull":{"super_type":"Unknown"}},"options":{"collect_groups":"ElementWise","fmt_str":"","input_wildcard_expansion":false,"returns_scalar":false,"cast_to_supertypes":true,"allow_rename":false,"pass_name_to_apply":false,"changes_length":false,"check_lengths":true,"allow_group_aware":true}}}}}}},"output_column"]}

0.20.23
{"Alias":[{"Ternary":{"predicate":{"Function":{"input":[{"Function":{"input":["Wildcard"],"function":{"Boolean":"IsNull"},"options":{"collect_groups":"ElementWise","fmt_str":"","input_wildcard_expansion":false,"returns_scalar":false,"cast_to_supertypes":false,"allow_rename":false,"pass_name_to_apply":false,"changes_length":false,"check_lengths":true,"allow_group_aware":true}}}],"function":{"Boolean":"AnyHorizontal"},"options":{"collect_groups":"GroupWise","fmt_str":"","input_wildcard_expansion":true,"returns_scalar":false,"cast_to_supertypes":false,"allow_rename":false,"pass_name_to_apply":false,"changes_length":false,"check_lengths":true,"allow_group_aware":true}}},"truthy":{"Literal":{"Int":-1}},"falsy":{"BinaryExpr":{"left":{"Literal":{"Int":100}},"op":"Plus","right":{"Function":{"input":[{"Column":"input_column_1"},{"Literal":{"Int":1}}],"function":"FillNull","options":{"collect_groups":"ElementWise","fmt_str":"","input_wildcard_expansion":false,"returns_scalar":false,"cast_to_supertypes":true,"allow_rename":false,"pass_name_to_apply":false,"changes_length":false,"check_lengths":true,"allow_group_aware":true}}}}}}},"output_column"]}

And here is the simplified code used to attempt to deserialize these JSON serializations:

import polars as pl

print(pl.__version__)

filename = "expr_0_20_21.json"
# filename = "expr_0_20_22.json"
# filename = "expr_0_20_23.json"

expr = pl.Expr.deserialize(filename)
print(expr)

Log output

Traceback (most recent call last):
  File "deserialize.py", line 7, in <module>
    expr = pl.Expr.deserialize(filename)
  File "[redacted]/env-polars/lib/python3.8/site-packages/polars/expr/expr.py", line 373, in deserialize
    expr._pyexpr = PyExpr.deserialize(source)
polars.exceptions.ComputeError: could not deserialize input into an expression

Issue description

I noticed a few differences in the JSON serialization of a Polars Expression written in Python and serialized in different versions of Polars (0.20.21, 0.20.22 and 0.20.23), and I was wondering if this is really intended, since it was a PATCH version upgrade (semver) and the JSONs serialized in 0.20.21 or 0.20.22 raised an error when I tried to deserialize them in 0.20.23 (and vice versa). This issue with 0.20.23 is also happening with the latest version 0.20.31.

The differences between the JSONs from 0.20.21 and 0.20.22 don't seem to break the deserialization in each other's versions, as the expression serialized in 0.20.21 could still be deserialized in 0.20.22 and vice versa.

image

Now, between the JSONs from 0.20.22 and 0.20.23 it seems there were more "critical" differences, since the deserialization broke in each other's versions with the following exception:

polars.exceptions.ComputeError: could not deserialize input into an expression

image

Given this scenario, I'd like to know if this behavior is really intended. I'd also like to confirm if Polars Expressions serialized in Python Polars are safe to be deserialized and used in Rust Polars in a production environment. If so, is there a compatibility matrix between the Python and Rust releases, as they seem to be versioned differently (as of 2024-06-07, the latest version of Polars in PyPI is 0.20.31 while the latest version for Rust is 0.40.0)?

Expected behavior

I expected the expressions serialization to JSON across all mentioned versions (0.20.21 through 0.20.23) to be the same and compatible with each other's version.

Installed versions

As mentioned previously, this issue considers different versions of the library (0.20.21, 0.20.22 and 0.20.23), but all of them will follow the same environment pasted below, with the only exception being the version of Polars itself.

--------Version info---------
Polars:               0.20.23
Index type:           UInt32
Platform:             Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.29
Python:               3.8.10 (default, Mar 15 2022, 12:22:08) 
[GCC 9.4.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               <not installed>
matplotlib:           <not installed>
nest_asyncio:         <not installed>
numpy:                <not installed>
openpyxl:             <not installed>
pandas:               <not installed>
pyarrow:              <not installed>
pydantic:             <not installed>
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@hiideaki hiideaki added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Jun 8, 2024
@deanm0000
Copy link
Collaborator

The version number of python and rust are definitely not the same. I think the closest thing to a compatibility matrix is to look at the release history and note what the latest rust version was at any particular python release. In this example, it looks like the rust version was 0.39.0 for python's 0.20.21 and then rust 0.39.2 for python 0.20.22 and 0.20.23.

It would be nice if deserializations were backward compatible although that's more of an enhancement, a challenging one too I think. It might be easier and better for binary size to have a separate tool that can "upgrade" the serialization from one version up to the next rather than relying on the core library to interpret any old version serialization. @MarcoGorelli I know you have a tool that updates the code from one version to the next, any thoughts on this (not that I'm trying to imply that this is in scope for that effort)?

@deanm0000 deanm0000 added enhancement New feature or an improvement of an existing feature needs decision Awaiting decision by a maintainer and removed bug Something isn't working needs triage Awaiting prioritization by a maintainer labels Jun 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature needs decision Awaiting decision by a maintainer python Related to Python Polars
Projects
None yet
Development

No branches or pull requests

2 participants