-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Support for semi-join on index (subsetting a multi-index dataframe with a subset of the multi-index) #58873
Comments
@gwerbin Can you share a reproducible example? |
Seems to me this is |
It's not exactly the same as the right join because I'm not interested in any of the columns from P.join(Q.loc[:, []], how="right") I also had different behavior in mind for if there were duplicates in Maybe this is a good enough recipe to warrant not adding new functionality to the Pandas interface, but it's not obvious and IMO would look like opaque magic to a typical non-expert user. The non-duplicate caseP = pd.DataFrame(
data=[
(1, 9, "u", -1.70),
(1, 9, "v", -1.75),
(2, 8, "u", -1.60),
(2, 8, "v", -1.65),
(1, 8, "u", -1.50),
(1, 8, "v", -1.55),
(2, 7, "u", -1.40),
(2, 7, "v", -1.45),
],
columns=["a", "b", "c", "x"],
).set_index(["a", "b", "c"])
Q = pd.DataFrame(
data=[
(1, 9, 2.5),
(2, 7, 3.5),
],
columns=["a", "b", "y"],
).set_index(["a", "b"])
expected = pd.DataFrame(
data=[
(1, 9, "u", -1.70),
(1, 9, "v", -1.75),
(2, 7, "u", -1.40),
(2, 7, "v", -1.45),
],
columns=["a", "b", "c", "x"],
).set_index(["a", "b", "c"])
pd.testing.assert_frame_equal(
P.join(Q.loc[:, []], how="right"),
expected,
) The duplicate caseI didn't want to complicate things by including this example, but this is where what I want diverges from a typical right join. Inputs: P = pd.DataFrame(
data=[
(1, 9, "u", -1.70),
(1, 9, "v", -1.75),
(2, 8, "u", -1.60),
(2, 8, "v", -1.65),
(1, 8, "u", -1.50),
(1, 8, "v", -1.55),
(2, 7, "u", -1.40),
(2, 7, "v", -1.45),
],
columns=["a", "b", "c", "x"],
).set_index(["a", "b", "c"])
Q = pd.DataFrame(
data=[
(1, 9, 2.5),
(2, 7, 3.5),
(2, 7, 4.5),
],
columns=["a", "b", "y"],
).set_index(["a", "b"])
expected = pd.DataFrame(
data=[
(1, 9, "u", -1.70),
(1, 9, "v", -1.75),
(2, 7, "u", -1.40),
(2, 7, "v", -1.45),
],
columns=["a", "b", "c", "x"],
).set_index(["a", "b", "c"])
pd.testing.assert_frame_equal(
P.join(Q.loc[~Q.index.duplicated(keep="first"), []], how="right"),
expected,
) Prior artNote that Polars supports They also support the similarly-useful |
It can be slightly simpler with
Adding a new method/arguments to pandas for cases like this is not sustainable in my opinion. |
Sorry - I missed this but my position is the same. pandas provides all the tools for you to accomplish the computation in a reasonable line of code. |
Are "filtering joins" not a common use case? Semis along with the matching anti-joins? Anti-join is the one that bothers me more though as it requires more code to replicate (and I use it very often).
Personally, I always find it somewhat odd that the filtering join types are absent in Pandas. A simple and explicit syntax for the filtering joins seems like it would be beneficial here. edit: and it is a further complication, but as @gwerbin mentioned, usually semi/anti joins are "safe" from multiple-merge issues--since the function knows you are only filtering your data it won't duplicate rows in the left where there are multiple matching results on the right. In my opinion, this is one of the big values to using them as it saves effort and makes your code more robust. |
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Problem Description
I think there are some related issues on this topic, but none that clearly describes the particular use case I have in mind.
Consider data frame
P
with multi-index columnsa, b, c
and data frameQ
with multi-index columnsa, b
.Given some subset of rows from
Q
, I want to be able to use its.index
to subset rows fromP
, without a lot of intermediate processing or boilerplate code.In relational terms, this is a semi-join operation between P and the subset of rows from Q.
Feature Description
Setup:
Desired functionality
Any of these, or something similarly convenient:
Alternative Solutions
Something like this, I think?
In the case when
Q_sub
is exactly 1 row, we can also use.xs
:Additional Context
Maybe related to #4036 and #3057
The text was updated successfully, but these errors were encountered: