Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pl.<ex-horizontal-operation>() gives unhelpful duplicate column error #11060

Open
mcrumiller opened this issue Sep 12, 2023 · 7 comments
Open
Labels
A-exceptions Area: exception handling bug Something isn't working P-low Priority: low python Related to Python Polars

Comments

@mcrumiller
Copy link
Contributor

mcrumiller commented Sep 12, 2023

Issue Description

import polars as pl

pl.DataFrame({'a': [1], 'b': [1]}).select(
    pl.sum('a', 'b').alias('c')
)
exceptions.DuplicateError: column with name 'c' has more than one occurrences

pl.sum() in this context has been replaced with pl.sum_horizontal, but this is the wrong error.

Expected behavior

pl.sum() now takes one argument, an expression. If it gets more than one, it should give an error indicating that the API has changed. The current error is incorrect.

@mcrumiller mcrumiller added bug Something isn't working python Related to Python Polars labels Sep 12, 2023
@reswqa
Copy link
Collaborator

reswqa commented Sep 12, 2023

Since pl.sum() is worked as pl.col("a", "b").sum(), if we pass more than one cols, it will expend into col("a").sum() and col("b").sum(). If we add the following alias to it, there will be a problem. That's to say, this may not a bug.

@mcrumiller
Copy link
Contributor Author

I think the reason is that pl.sum() performs a sum on each column, so it's basically the equivalent of attempting to perform an alias on multiple columns, e.g.:

pl.DataFrame({'a': [1], 'b': [1]}).select(pl.all().alias('c'))
exceptions.DuplicateError: column with name 'c' has more than one occurrences

@mcrumiller mcrumiller changed the title pl.sum() gives wrong duplicate column error pl.<ex-horizontal-operation>() gives unhelpful duplicate column error Sep 12, 2023
@mcrumiller
Copy link
Contributor Author

I encountered this error in my codebase multiple times, all with pl.sum, pl.any, pl.all, etc. These have all since been replaced by pl.*_horizontal and I think we should give a more informative error when it's applied to multiple columns.

@reswqa
Copy link
Collaborator

reswqa commented Sep 12, 2023

These have all since been replaced by pl.*_horizontal

I'm a bit confuse about this, shouldn't the behavior of pl.sum ("a", "b") and pl.sum_horizontal ("a", "b") be different(one map to shape(1,2) but another map to shape(1, 1))? What does' replace 'mean here 🤔

@reswqa
Copy link
Collaborator

reswqa commented Sep 12, 2023

One possible optimization is that when we perform alias on multiple column expressions(e.g. pl.all(), cs.numeric()), we should give a better error information.

@cmdlineluser
Copy link
Contributor

Another one I noticed is this could not convert value 'Unknown' as a Literal message which seems a bit cryptic:

df = pl.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})

df.select(pl.all(pl.all() > 5))

# ValueError: could not convert value 'Unknown' as a Literal

@stinodego
Copy link
Member

One possible optimization is that when we perform alias on multiple column expressions(e.g. pl.all(), cs.numeric()), we should give a better error information.

This would be the desired solution here. Not sure if it's easy to implement though.

@stinodego stinodego added P-low Priority: low A-exceptions Area: exception handling and removed needs triage Awaiting prioritization by a maintainer labels Jan 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-exceptions Area: exception handling bug Something isn't working P-low Priority: low python Related to Python Polars
Projects
Status: Ready
Development

No branches or pull requests

4 participants