ENH: Support for semi-join on index (subsetting a multi-index dataframe with a subset of the multi-index) #58873

gwerbin · 2024-05-31T14:33:36Z

Feature Type

Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas

Problem Description

I think there are some related issues on this topic, but none that clearly describes the particular use case I have in mind.

Consider data frame P with multi-index columns a, b, c and data frame Q with multi-index columns a, b.

Given some subset of rows from Q, I want to be able to use its .index to subset rows from P, without a lot of intermediate processing or boilerplate code.

In relational terms, this is a semi-join operation between P and the subset of rows from Q.

Feature Description

Setup:

P = pd.DataFrame(..., index=pd.MultiIndex(..., names=["a", "b", "c"]))
Q = pd.DataFrame(..., index=pd.MultiIndex(..., names=["a", "b"]))

assert not P.index.duplicated.any()
assert not Q.index.duplicated.any()

Q_sub = Q.head()

Desired functionality

Any of these, or something similarly convenient:

P_sub = P.loc[Q_sub.index]

P_sub = P.xs(Q_sub.index)

P_sub = P.xs(Q_sub.index, level=Q_sub.index.names)

P_sub = P.join_semi(Q_sub, how="semi")

Alternative Solutions

Something like this, I think?

P_sub = P.loc[pd.IndexSlice[Q_sub.index.get_level_values("a"), Q_sub.index.get_level_values("b"), :]]

In the case when Q_sub is exactly 1 row, we can also use .xs:

P_sub = P.xs(Q_sub.index[0], level=Q_sub.index.names)

Additional Context

Maybe related to #4036 and #3057

The text was updated successfully, but these errors were encountered:

samukweku · 2024-05-31T21:44:16Z

@gwerbin Can you share a reproducible example?

rhshadrach · 2024-06-02T13:05:51Z

Seems to me this is P.join(Q, how="right"), but as @samukweku said - please provide a reproducible example with the input and desired output.

gwerbin · 2024-06-03T22:37:17Z

It's not exactly the same as the right join because I'm not interested in any of the columns from Q, only subsetting indices. But I realize now that you could write it like this, which is a lot tidier than what I was doing before:

P.join(Q.loc[:, []], how="right")

I also had different behavior in mind for if there were duplicates in Q.index than what right join offers, but that wasn't part of my original example.

Maybe this is a good enough recipe to warrant not adding new functionality to the Pandas interface, but it's not obvious and IMO would look like opaque magic to a typical non-expert user.

The non-duplicate case

P = pd.DataFrame(
    data=[
        (1, 9, "u", -1.70),
        (1, 9, "v", -1.75),
        (2, 8, "u", -1.60),
        (2, 8, "v", -1.65),
        (1, 8, "u", -1.50),
        (1, 8, "v", -1.55),
        (2, 7, "u", -1.40),
        (2, 7, "v", -1.45),
    ],
    columns=["a", "b", "c", "x"],
).set_index(["a", "b", "c"])

Q = pd.DataFrame(
    data=[
        (1, 9, 2.5),
        (2, 7, 3.5),
    ],
    columns=["a", "b", "y"],
).set_index(["a", "b"])

expected = pd.DataFrame(
    data=[
        (1, 9, "u", -1.70),
        (1, 9, "v", -1.75),
        (2, 7, "u", -1.40),
        (2, 7, "v", -1.45),
    ],
    columns=["a", "b", "c", "x"],
).set_index(["a", "b", "c"])

pd.testing.assert_frame_equal(
    P.join(Q.loc[:, []], how="right"),
    expected,
)

The duplicate case

I didn't want to complicate things by including this example, but this is where what I want diverges from a typical right join.

Inputs:

P = pd.DataFrame(
    data=[
        (1, 9, "u", -1.70),
        (1, 9, "v", -1.75),
        (2, 8, "u", -1.60),
        (2, 8, "v", -1.65),
        (1, 8, "u", -1.50),
        (1, 8, "v", -1.55),
        (2, 7, "u", -1.40),
        (2, 7, "v", -1.45),
    ],
    columns=["a", "b", "c", "x"],
).set_index(["a", "b", "c"])

Q = pd.DataFrame(
    data=[
        (1, 9, 2.5),
        (2, 7, 3.5),
        (2, 7, 4.5),
    ],
    columns=["a", "b", "y"],
).set_index(["a", "b"])

expected = pd.DataFrame(
    data=[
        (1, 9, "u", -1.70),
        (1, 9, "v", -1.75),
        (2, 7, "u", -1.40),
        (2, 7, "v", -1.45),
    ],
    columns=["a", "b", "c", "x"],
).set_index(["a", "b", "c"])

pd.testing.assert_frame_equal(
    P.join(Q.loc[~Q.index.duplicated(keep="first"), []], how="right"),
    expected,
)

Prior art

Note that Polars supports how="semi" join in the polars.DataFrame.join method: https://docs.pola.rs/py-polars/html/reference/dataframe/api/polars.DataFrame.join.html

They also support the similarly-useful how="anti" join, which was also not part of the original intended scope, but would IMO be useful in Pandas for similar reasons.

rhshadrach · 2024-06-04T22:04:36Z

But I realize now that you could write it like this, which is a lot tidier than what I was doing before:
P.join(Q.loc[:, []], how="right")

It can be slightly simpler with P.join(Q[[]], how="right").

Maybe this is a good enough recipe to warrant not adding new functionality to the Pandas interface, but it's not obvious and IMO would look like opaque magic to a typical non-expert user.

Adding a new method/arguments to pandas for cases like this is not sustainable in my opinion.

rhshadrach · 2024-06-04T22:07:02Z

P.join(Q.loc[~Q.index.duplicated(keep="first"), []], how="right")

Sorry - I missed this but my position is the same. pandas provides all the tools for you to accomplish the computation in a reasonable line of code.

ottothecow · 2024-10-05T16:21:27Z

Adding a new method/arguments to pandas for cases like this is not sustainable in my opinion.

Are "filtering joins" not a common use case? Semis along with the matching anti-joins? Anti-join is the one that bothers me more though as it requires more code to replicate (and I use it very often).

Pyspark/spark implements left semi, right semi, left anti, and right anti.
Dplyr implements semi and anti joins (only left versions, but that's really all you need)
Polars implements them as @gwerbin mentioned.
SQL engines sort of implement them as EXISTS and NOT EXISTS (and talk about semi/anti conceptually in their docs), although their functionality is more trivial to implement in SQL syntax (SELECT left.* and either inner join or exclude where right.join_var IS NULL)
Julia implements left semi and left anti.

Personally, I always find it somewhat odd that the filtering join types are absent in Pandas. A simple and explicit syntax for the filtering joins seems like it would be beneficial here.

edit: and it is a further complication, but as @gwerbin mentioned, usually semi/anti joins are "safe" from multiple-merge issues--since the function knows you are only filtering your data it won't duplicate rows in the left where there are multiple matching results on the right. In my opinion, this is one of the big values to using them as it saves effort and makes your code more robust.

gwerbin added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels May 31, 2024

gwerbin changed the title ~~ENH: Support for subsetting a multi-index dataframe with a subset of the multi-index~~ ENH: Support for semi-join on index (subsetting a multi-index dataframe with a subset of the multi-index) May 31, 2024

rhshadrach added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Needs Info Clarification about behavior needed to assess issue and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 2, 2024

rhshadrach added Closing Candidate May be closeable, needs more eyeballs and removed Needs Info Clarification about behavior needed to assess issue labels Jun 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Support for semi-join on index (subsetting a multi-index dataframe with a subset of the multi-index) #58873

ENH: Support for semi-join on index (subsetting a multi-index dataframe with a subset of the multi-index) #58873

gwerbin commented May 31, 2024 •

edited

Loading

samukweku commented May 31, 2024

rhshadrach commented Jun 2, 2024

gwerbin commented Jun 3, 2024 •

edited

Loading

rhshadrach commented Jun 4, 2024

rhshadrach commented Jun 4, 2024

ottothecow commented Oct 5, 2024 •

edited

Loading

ENH: Support for semi-join on index (subsetting a multi-index dataframe with a subset of the multi-index) #58873

ENH: Support for semi-join on index (subsetting a multi-index dataframe with a subset of the multi-index) #58873

Comments

gwerbin commented May 31, 2024 • edited Loading

Feature Type

Problem Description

Feature Description

Desired functionality

Alternative Solutions

Additional Context

samukweku commented May 31, 2024

rhshadrach commented Jun 2, 2024

gwerbin commented Jun 3, 2024 • edited Loading

The non-duplicate case

The duplicate case

Prior art

rhshadrach commented Jun 4, 2024

rhshadrach commented Jun 4, 2024

ottothecow commented Oct 5, 2024 • edited Loading

gwerbin commented May 31, 2024 •

edited

Loading

gwerbin commented Jun 3, 2024 •

edited

Loading

ottothecow commented Oct 5, 2024 •

edited

Loading