Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List min/max operations on List[Enum] are returning Strings #19269

Closed
2 tasks done
etrotta opened this issue Oct 16, 2024 · 0 comments · Fixed by #19301
Closed
2 tasks done

List min/max operations on List[Enum] are returning Strings #19269

etrotta opened this issue Oct 16, 2024 · 0 comments · Fixed by #19301
Assignees
Labels
accepted Ready for implementation bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars

Comments

@etrotta
Copy link
Contributor

etrotta commented Oct 16, 2024

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
en = pl.Enum(['X', 'Z', 'Y'])
df = pl.DataFrame({'test': pl.Series(['X', 'Y', 'Z'], dtype=en), 'group': [1, 2, 2]})
print(df.group_by('group').agg(pl.col('test').mode()).select(pl.col('test').list.max()))

# It is returning the wrong dtype. Other list operations like `.first()` instead of `.max()`return the right dtype`
print(df.group_by('group').agg(pl.col('test').mode()).select(pl.col('test').list.first()))
# It is using the correct Enum ordering for the max() operation though, which you can verify by comparing the result against
df = pl.DataFrame({'test': pl.Series(['X', 'Y', 'Z'], dtype=str), 'group': [1, 2, 2]})

print(df.group_by('group').agg(pl.col('test').mode()).select(pl.col('test').list.max()))

Log output

stderr:
keys/aggregates are not partitionable: running default HASH AGGREGATION
keys/aggregates are not partitionable: running default HASH AGGREGATION
keys/aggregates are not partitionable: running default HASH AGGREGATION

stdout:
# list[enum] .list.max() ; correct values, wrong dtype
shape: (2, 1)
┌──────┐
│ test │
│ ---  │
│ str  │
╞══════╡
│ Y    │
│ X    │
└──────┘
# list[enum] .list.first() ; correct dtype for reference
shape: (2, 1)
┌──────┐
│ test │
│ ---  │
│ enum │
╞══════╡
│ X    │
│ Y    │
└──────┘
# list[str] list.max() ; just for reference to check that it is actually different from list[enum]'s .max()
shape: (2, 1)
┌──────┐
│ test │
│ ---  │
│ str  │
╞══════╡
│ X    │
│ Z    │
└──────┘

Issue description

In Lazy mode, collect_schema() also ends with up a result different from collect().schema,

>>> df.lazy().group_by('group').agg(pl.col('test').mode()).select(pl.col('test').list.max()).collect_schema()
Schema({'test': Enum(categories=['X', 'Z', 'Y'])})
>>> df.lazy().group_by('group').agg(pl.col('test').mode()).select(pl.col('test').list.max()).collect().schema
keys/aggregates are not partitionable: running default HASH AGGREGATION
Schema({'test': String})

Might be related to #18394

Expected behavior

The .list.max() operation applied on a list[Enum] should return a series of dtype Enum rather than str

Installed versions

--------Version info---------
Polars:              1.9.0
Index type:          UInt32
Platform:            Windows-11-10.0.22631-SP0
Python:              3.12.0 (tags/v3.12.0:0fb18b0, Oct  2 2023, 13:03:39) [MSC v.1935 64 bit (AMD64)]

----Optional dependencies----
adbc_driver_manager  <not installed>
altair               <not installed>
cloudpickle          <not installed>
connectorx           <not installed>
deltalake            <not installed>
fastexcel            0.9.0
fsspec               <not installed>
gevent               <not installed>
great_tables         <not installed>
matplotlib           3.8.2
nest_asyncio         <not installed>
numpy                1.26.4
openpyxl             3.1.2
pandas               2.2.2
pyarrow              15.0.0
pydantic             2.5.2
pyiceberg            <not installed>
sqlalchemy           2.0.29
torch                <not installed>
xlsx2csv             <not installed>
xlsxwriter           <not installed>
@etrotta etrotta added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Oct 16, 2024
@c-peters c-peters added the accepted Ready for implementation label Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants