Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

descending arg looks backwards in top_k #15236

Closed
2 tasks done
MarcoGorelli opened this issue Mar 22, 2024 · 9 comments · Fixed by #16817
Closed
2 tasks done

descending arg looks backwards in top_k #15236

MarcoGorelli opened this issue Mar 22, 2024 · 9 comments · Fixed by #16817
Assignees
Labels
A-api Area: changes to the public API accepted Ready for implementation
Milestone

Comments

@MarcoGorelli
Copy link
Collaborator

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

This came up here: https://discord.com/channels/908022250106667068/1082941945216782397/1220375752428753027

import polars as pl
df = pl.DataFrame({'a': [1,3,2]})
df.top_k(k=1, by='a', descending=False)

Log output

shape: (1, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 3   │
└─────┘

Issue description

If I read df.top_k(k=1, by='a', descending=False), I expect it to mean "get the top 1 smallest element". However, it actually finds the top 1 largest element (contradicting descending=False?)

Conversely, df.top_k(k=1, by='a', descending=True) finds the smallest element

Expected behavior

Not sure, it requires more discussion. But for a start, there should be an issue about it :)

Polars used to have reverse arguments for sorting functions, but they were changed to descending. But in this case, reverse does feel quite natural

I don't really know what to suggest, but comments are very welcome

Installed versions

--------Version info---------
Polars:               0.20.16
Index type:           UInt32
Platform:             Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
Python:               3.11.8 (main, Feb 25 2024, 16:39:33) [GCC 11.4.0]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               <not installed>
gevent:               <not installed>
hvplot:               0.9.2
matplotlib:           3.8.3
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.1
pyarrow:              15.0.1
pydantic:             2.6.1
pyiceberg:            <not installed>
pyxlsb:               <not installed>
sqlalchemy:           <not installed>
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@MarcoGorelli MarcoGorelli added bug Something isn't working python Related to Python Polars needs triage Awaiting prioritization by a maintainer needs decision Awaiting decision by a maintainer A-api Area: changes to the public API and removed needs triage Awaiting prioritization by a maintainer bug Something isn't working python Related to Python Polars labels Mar 22, 2024
@mcrumiller
Copy link
Contributor

mcrumiller commented Mar 22, 2024

Confusing in general. I would expect the descending parameter to change the sorting of the output, not what the "top" values are. We have a bottom_k() function for that already.

@MarcoGorelli
Copy link
Collaborator Author

not what the "top" values are. We have a bottom_k() function for that already.

you can pass multiple values to descending though, so bottom_k isn't a full replacement:

In [44]: df = pl.DataFrame({'a': [1, 2, 3], 'b': [6, 5, 4], 'c': [7,8,9]})

In [45]: df.top_k(k=2, by=['a', 'b'], descending=[True, False])
Out[45]:
shape: (2, 3)
┌─────┬─────┬─────┐
│ a   ┆ b   ┆ c   │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1   ┆ 6   ┆ 7   │
│ 2   ┆ 5   ┆ 8   │
└─────┴─────┴─────┘

@mcrumiller
Copy link
Contributor

I see. Seems like a single function called extrema would work better, but people will undoubtedly be searching for top_k and bottom_k since those are pretty prevalent in other libraries.

@MarcoGorelli
Copy link
Collaborator Author

Yup...which leaves open the question of "what to do with descending?". I don't know 😄

@mcrumiller
Copy link
Contributor

mcrumiller commented Mar 22, 2024

A somewhat more explicit but verbose would be a sort_method parameter that's a list of "ascending" and "descending":

import polars as pl

df = pl.DataFrame({
    "a": [1, 1, 2, 2, 3, 3],
    "b": [2, 1, 2, 1, 2, 1],
})

# (a=3, b=2)
df.top_k(k=1, by=pl.all())

# (a=3, b=1)
df.top_k(k=1, by=pl.all(), sort_method=["ascending", "descending"])

# (a=1, b=1)
df.bottom_k(k=1, by=pl.all()) 

# (a=1, b=2)
df.bottom_k(k=1, by=pl.all(), sort_method="descending", "ascending")

@mcrumiller
Copy link
Contributor

And actually, just renaming descending to sort_descending would be more obvious. And then the bottom_k parameter could be sort_ascending.

@MarcoGorelli
Copy link
Collaborator Author

MarcoGorelli commented Mar 22, 2024

I don't think it can just be renamed, because then the output of sort_descending=False actually would be sorted descending

I feel like suggesting find_smallest, so then code can just be updated without other changes?

Current behaviour:

In [46]: import polars as pl
    ...: df = pl.DataFrame({'a': [1,3,2]})
    ...: df.top_k(k=3, by='a', descending=False)
Out[46]:
shape: (3, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 3   │
│ 2   │
│ 1   │
└─────┘

Proposed new api:

In [46]: import polars as pl
    ...: df = pl.DataFrame({'a': [1,3,2]})
    ...: df.top_k(k=3, by='a', find_smallest=False)
Out[46]:
shape: (3, 1)
┌─────┐
│ a   │
│ --- │
│ i64 │
╞═════╡
│ 3   │
│ 2   │
│ 1   │
└─────┘

@mtomko
Copy link

mtomko commented Apr 4, 2024

I ran across this yesterday in a different context - it looks to me like the underlying Rust code is doing it backwards. Here's where the Python code calls into the Rust, if I understand it correctly:

self._ldf.top_k(k, by, descending, nulls_last, maintain_order)

To me it seems like the Python code reflects the Rust API and is faithfully passing descending through.

I don't like this behavior but it feels like the Python and Rust code should be kept consistent, so my temptation would be to look in to addressing the problem upstream of the Python bindings.

I'm personally not actually using Python at all except as an example (we're looking at the Scala JNI bindings to the Rust backend) but my coworkers would want to use the Python bindings. It would be excellent for us to be able to be on the same page with respect to the API shape.

@stinodego stinodego added accepted Ready for implementation and removed needs decision Awaiting decision by a maintainer labels Jun 7, 2024
@stinodego stinodego self-assigned this Jun 7, 2024
@stinodego
Copy link
Member

We will rename the parameter to reverse in top_k and bottom_k.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-api Area: changes to the public API accepted Ready for implementation
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants