Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: allow preserving one of the indexes when merging two DataFrames #46882

Open
multimeric opened this issue Apr 27, 2022 · 4 comments
Open
Assignees
Labels
Enhancement Needs Discussion Requires discussion from core team before further action Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@multimeric
Copy link
Contributor

Is your feature request related to a problem?

I want to be able to merge two DataFrames, but keep the index of the left one in the final result:

>>> import pandas as pd
>>> import string
>>> df1 = pd.DataFrame({"a": range(5), "b": range(10, 15)}, index=list(string.ascii_lowercase[:5]))
>>> df2 = pd.DataFrame({"a": range(5), "c": list(string.ascii_uppercase[:5])})
>>> df1
   a   b
a  0  10
b  1  11
c  2  12
d  3  13
e  4  14
>>> df2
   a  c
0  0  A
1  1  B
2  2  C
3  3  D
4  4  E

The current merge behaviour is to just drop the index entirely:

>>> df1.merge(df2, on="a")
   a   b  c
0  0  10  A
1  1  11  B
2  2  12  C
3  3  13  D
4  4  14  E

Describe the solution you'd like

We add a new parameter preserve_index to merge, which takes either "left", "right", or None

DataFrame.merge(preserve_index="left")

In my above example, this would work like:

>>> df1.merge(df2, on="a", preserve_index="left")
   a   b  c
a  0  10  A
b  1  11  B
c  2  12  C
d  3  13  D
e  4  14  E

API breaking implications

None. This is a new parameter, and if it is not provided the API is identical.

Describe alternatives you've considered

It is already possible to work around this by resetting the index and then setting it as an index again, as described here but this is:

  • More verbose
  • Not intuitive or clear to users (hence the StackOverflow question's popularity)
  • Probably less efficient
@multimeric multimeric added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 27, 2022
@attack68
Copy link
Contributor

attack68 commented Apr 29, 2022

isn't it just as easy to use df1.merge(df2, on="a").set_index("a")?
Otherwise we risk introducing features that need to be maintained and tested with further developments when these method already exist?

edit:
Now i see the end of your post, ok, but im -1 on this.

@multimeric
Copy link
Contributor Author

multimeric commented Apr 29, 2022

You also have to reset the index to ensure it's a column, and I think the three points above show enough merit to make this worthwhile. A chain of 3 methods versus one method and one parameter is a big improvement.

@simonjayhawkins simonjayhawkins added Reshaping Concat, Merge/Join, Stack/Unstack, Explode and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 4, 2022
@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone May 4, 2022
@Mehgarg
Copy link
Contributor

Mehgarg commented Jul 15, 2022

take

@Mehgarg Mehgarg mentioned this issue Aug 4, 2022
5 tasks
@attack68
Copy link
Contributor

attack68 commented Aug 5, 2022

@multimeric its fair to give a full response on this since you raise sensible points.

The pandas API is large (too large). My general approach is to not add any args / methods that perform functions that can already be performed. In fact I am in favour of selectively removing / reducing args when multiple ways of performing tasks exist. And my PRs reflect this philosophy.

Probably less efficient

In the long run this has the advantage of making code more maintainable for developers, and likely improves performance since those core methods can be optimised for general tasks as opposed to optimising selective and individual cases, or specific ways to handle args. This is important for the longevity, and future development of pandas.

More verbose

This is subjective. Personally I strive for an atomised code construction. In software development I prefer using core methods rather than subtle args to avoid the operational risk of arg deprecation.
merge and set_index are core methods so are unlikely to be restructured, so I would favour chaining these, especially where merge is such a complex method in terms of combinatorial challenges.

Not intuitive or clear to users

Fully agree. I think use cases like this and adding to documention and cookbooks are valuable and we should work to provide better examples that users can copy, in the knowledge that pandas teams offers confidence that it is the "most efficient" way. This is a development item and something we need to do better.

Sorry I don't support your idea, hope you appreciate my feedback.

@mroeschke mroeschke added the Needs Discussion Requires discussion from core team before further action label Aug 8, 2022
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Needs Discussion Requires discussion from core team before further action Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants