Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Support for semi-join on index (subsetting a multi-index dataframe with a subset of the multi-index) #58873

Open
1 of 3 tasks
gwerbin opened this issue May 31, 2024 · 6 comments
Labels
Closing Candidate May be closeable, needs more eyeballs Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@gwerbin
Copy link

gwerbin commented May 31, 2024

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I think there are some related issues on this topic, but none that clearly describes the particular use case I have in mind.

Consider data frame P with multi-index columns a, b, c and data frame Q with multi-index columns a, b.

Given some subset of rows from Q, I want to be able to use its .index to subset rows from P, without a lot of intermediate processing or boilerplate code.

In relational terms, this is a semi-join operation between P and the subset of rows from Q.

Feature Description

Setup:

P = pd.DataFrame(..., index=pd.MultiIndex(..., names=["a", "b", "c"]))
Q = pd.DataFrame(..., index=pd.MultiIndex(..., names=["a", "b"]))

assert not P.index.duplicated.any()
assert not Q.index.duplicated.any()

Q_sub = Q.head()

Desired functionality

Any of these, or something similarly convenient:

P_sub = P.loc[Q_sub.index]

P_sub = P.xs(Q_sub.index)

P_sub = P.xs(Q_sub.index, level=Q_sub.index.names)

P_sub = P.join_semi(Q_sub, how="semi")

Alternative Solutions

Something like this, I think?

P_sub = P.loc[pd.IndexSlice[Q_sub.index.get_level_values("a"), Q_sub.index.get_level_values("b"), :]]

In the case when Q_sub is exactly 1 row, we can also use .xs:

P_sub = P.xs(Q_sub.index[0], level=Q_sub.index.names)

Additional Context

Maybe related to #4036 and #3057

@gwerbin gwerbin added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels May 31, 2024
@gwerbin gwerbin changed the title ENH: Support for subsetting a multi-index dataframe with a subset of the multi-index ENH: Support for semi-join on index (subsetting a multi-index dataframe with a subset of the multi-index) May 31, 2024
@samukweku
Copy link
Contributor

@gwerbin Can you share a reproducible example?

@rhshadrach
Copy link
Member

Seems to me this is P.join(Q, how="right"), but as @samukweku said - please provide a reproducible example with the input and desired output.

@rhshadrach rhshadrach added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 2, 2024
@gwerbin
Copy link
Author

gwerbin commented Jun 3, 2024

It's not exactly the same as the right join because I'm not interested in any of the columns from Q, only subsetting indices. But I realize now that you could write it like this, which is a lot tidier than what I was doing before:

P.join(Q.loc[:, []], how="right")

I also had different behavior in mind for if there were duplicates in Q.index than what right join offers, but that wasn't part of my original example.

Maybe this is a good enough recipe to warrant not adding new functionality to the Pandas interface, but it's not obvious and IMO would look like opaque magic to a typical non-expert user.

The non-duplicate case

P = pd.DataFrame(
    data=[
        (1, 9, "u", -1.70),
        (1, 9, "v", -1.75),
        (2, 8, "u", -1.60),
        (2, 8, "v", -1.65),
        (1, 8, "u", -1.50),
        (1, 8, "v", -1.55),
        (2, 7, "u", -1.40),
        (2, 7, "v", -1.45),
    ],
    columns=["a", "b", "c", "x"],
).set_index(["a", "b", "c"])

Q = pd.DataFrame(
    data=[
        (1, 9, 2.5),
        (2, 7, 3.5),
    ],
    columns=["a", "b", "y"],
).set_index(["a", "b"])

expected = pd.DataFrame(
    data=[
        (1, 9, "u", -1.70),
        (1, 9, "v", -1.75),
        (2, 7, "u", -1.40),
        (2, 7, "v", -1.45),
    ],
    columns=["a", "b", "c", "x"],
).set_index(["a", "b", "c"])

pd.testing.assert_frame_equal(
    P.join(Q.loc[:, []], how="right"),
    expected,
)

The duplicate case

I didn't want to complicate things by including this example, but this is where what I want diverges from a typical right join.

Inputs:

P = pd.DataFrame(
    data=[
        (1, 9, "u", -1.70),
        (1, 9, "v", -1.75),
        (2, 8, "u", -1.60),
        (2, 8, "v", -1.65),
        (1, 8, "u", -1.50),
        (1, 8, "v", -1.55),
        (2, 7, "u", -1.40),
        (2, 7, "v", -1.45),
    ],
    columns=["a", "b", "c", "x"],
).set_index(["a", "b", "c"])

Q = pd.DataFrame(
    data=[
        (1, 9, 2.5),
        (2, 7, 3.5),
        (2, 7, 4.5),
    ],
    columns=["a", "b", "y"],
).set_index(["a", "b"])

expected = pd.DataFrame(
    data=[
        (1, 9, "u", -1.70),
        (1, 9, "v", -1.75),
        (2, 7, "u", -1.40),
        (2, 7, "v", -1.45),
    ],
    columns=["a", "b", "c", "x"],
).set_index(["a", "b", "c"])

pd.testing.assert_frame_equal(
    P.join(Q.loc[~Q.index.duplicated(keep="first"), []], how="right"),
    expected,
)

Prior art

Note that Polars supports how="semi" join in the polars.DataFrame.join method: https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.join.html

They also support the similarly-useful how="anti" join, which was also not part of the original intended scope, but would IMO be useful in Pandas for similar reasons.

@rhshadrach
Copy link
Member

But I realize now that you could write it like this, which is a lot tidier than what I was doing before:

P.join(Q.loc[:, []], how="right")

It can be slightly simpler with P.join(Q[[]], how="right").

Maybe this is a good enough recipe to warrant not adding new functionality to the Pandas interface, but it's not obvious and IMO would look like opaque magic to a typical non-expert user.

Adding a new method/arguments to pandas for cases like this is not sustainable in my opinion.

@rhshadrach
Copy link
Member

P.join(Q.loc[~Q.index.duplicated(keep="first"), []], how="right")

Sorry - I missed this but my position is the same. pandas provides all the tools for you to accomplish the computation in a reasonable line of code.

@rhshadrach rhshadrach added Closing Candidate May be closeable, needs more eyeballs and removed Needs Info Clarification about behavior needed to assess issue labels Jun 4, 2024
@ottothecow
Copy link

ottothecow commented Oct 5, 2024

Adding a new method/arguments to pandas for cases like this is not sustainable in my opinion.

Are "filtering joins" not a common use case? Semis along with the matching anti-joins? Anti-join is the one that bothers me more though as it requires more code to replicate (and I use it very often).

  • Pyspark/spark implements left semi, right semi, left anti, and right anti.
  • Dplyr implements semi and anti joins (only left versions, but that's really all you need)
  • Polars implements them as @gwerbin mentioned.
  • SQL engines sort of implement them as EXISTS and NOT EXISTS (and talk about semi/anti conceptually in their docs), although their functionality is more trivial to implement in SQL syntax (SELECT left.* and either inner join or exclude where right.join_var IS NULL)
  • Julia implements left semi and left anti.

Personally, I always find it somewhat odd that the filtering join types are absent in Pandas. A simple and explicit syntax for the filtering joins seems like it would be beneficial here.

edit: and it is a further complication, but as @gwerbin mentioned, usually semi/anti joins are "safe" from multiple-merge issues--since the function knows you are only filtering your data it won't duplicate rows in the left where there are multiple matching results on the right. In my opinion, this is one of the big values to using them as it saves effort and makes your code more robust.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Closing Candidate May be closeable, needs more eyeballs Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

No branches or pull requests

4 participants