Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't disable coalesce when joining on expressions #18194

Open
mcrumiller opened this issue Aug 14, 2024 · 3 comments
Open

Don't disable coalesce when joining on expressions #18194

mcrumiller opened this issue Aug 14, 2024 · 3 comments
Labels
enhancement New feature or an improvement of an existing feature needs decision Awaiting decision by a maintainer

Comments

@mcrumiller
Copy link
Contributor

mcrumiller commented Aug 14, 2024

Description

The documentation for join states the following for the coalesce parameter:

coalesce

Coalescing behavior (merging of join columns).

  • None: -> join specific.
  • True: -> Always coalesce join columns.
  • False: -> Never coalesce join columns.

Note that joining on any other expressions than col will turn off coalescing.

I'm not sure why joining on other expressions disables coalescing. To get around this behavior, I find myself constantly using the following patterns:

df2 = df2.with_columns(
    col("name_1").operations().alias("tmp_name_1"),
    col("name_2").operations().alias("tmp_name_2"),
    col("name_3").operations().alias("tmp_name_3"),
)
df1.join(df2,
    left_on=["name_1", "name_2", "name_3"],
    right_on=["tmp_name_1", "tmp_name_2", "tmp_name_3"],
    how="left",
    coalesce=True,
).drop("tmp_name_1", "tmp_name_2", "tmp_name_3")

Instead, it would be preferable to be able to do:

df1.join(df2,
    left_on=["name_1", "name_2", "name_3"],
    right_on=[col("name_1").operations(), col("name_2").operations(), col("name_3").operations()],
    how="left",
    coalesce=True
)
@mcrumiller mcrumiller added the enhancement New feature or an improvement of an existing feature label Aug 14, 2024
@deanm0000
Copy link
Collaborator

deanm0000 commented Aug 14, 2024

One small tweak to your workaround

df2 = df2.with_columns(
    expr1.alias("name_1"),
    expr2.alias("name_2"),
    expr3.alias("name_3"),
)
df1.join(df2,
    on=["name_1", "name_2", "name_3"],
    how="left",
    coalesce=True,
)

I'm guessing that it was an implementation difficulty that made it the way it is rather than a planned decision. Maybe to do with handling aliases.

@deanm0000 deanm0000 added the needs decision Awaiting decision by a maintainer label Aug 14, 2024
@mcrumiller
Copy link
Contributor Author

mcrumiller commented Aug 14, 2024

@deanm0000 the problem is that name_1, name_2, and name_3 already exist in df2, but I need to do an operation on them, for example a .shift, so I cannot re-alias to the same name.

I updated my example above to make that a bit clearer.

@ritchie46
Copy link
Member

It adds a lot of complexity and I don't think I am not sure it makes sense either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature needs decision Awaiting decision by a maintainer
Projects
None yet
Development

No branches or pull requests

3 participants