Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Key order after dataframe inner merge #53157

Closed
1 task done
lucdem opened this issue May 9, 2023 · 6 comments · Fixed by #54611
Closed
1 task done

DOC: Key order after dataframe inner merge #53157

lucdem opened this issue May 9, 2023 · 6 comments · Fixed by #54611
Labels

Comments

@lucdem
Copy link

lucdem commented May 9, 2023

Pandas version checks

  • I have checked that the issue still exists on the latest versions of the docs on main here

Location of the documentation

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

Documentation problem

The documentation for the 'how' parameters says:

left: use only keys from left frame, similar to a SQL left outer join; preserve key order.
inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.

From this description you could expect that the result of df1.merge(df2, on='column_name', how='inner') and df1.merge(df2, on='column_name', how='left') would both maintain the same order, however that's not what happens.

Example below, tested on version 2.0.1:

import pandas as pd

df = pd.DataFrame({
	'n': [1, 2, 3, 1, 2, 3],
	'i': [0, 1, 2, 3, 4, 5]
})

df2 = pd.DataFrame({
	'n': [1, 2, 3],
	'str': ['1', '2', '3']
})

print(df.merge(df2, on='n', how='inner'))
print('--------')
print(df.merge(df2, on='n', how='left'))

Output:

   n  i str
0  1  0   1
1  1  3   1
2  2  1   2
3  2  4   2
4  3  2   3
5  3  5   3
--------
   n  i str
0  1  0   1
1  2  1   2
2  3  2   3
3  1  3   1
4  2  4   2
5  3  5   3

Suggested fix for documentation

Either clarify that the merge operation will sort the results based on the left key when using 'inner' for the 'how' parameter, rather than "preserve the order", or explicitly state that the operation does not guarantee any order.

@lucdem lucdem added Docs Needs Triage Issue that has not been reviewed by a pandas team member labels May 9, 2023
@AlexKirko
Copy link
Member

@lucdem Could you please provide the issue with a name?

@lucdem lucdem changed the title DOC: DOC: Key order after dataframe inner merge May 12, 2023
@topper-123
Copy link
Contributor

I'd say "order the results based on first occurrence the left key " is more correct than "sort the results based on the left key" , see example below:

import pandas as pd

df = pd.DataFrame({
  'n': [2, 1, 3, 1, 2, 3],  # changed order
  'i': [0, 1, 2, 3, 4, 5]
})

df2 = pd.DataFrame({
  'n': [1, 2, 3],
  'str': ['1', '2', '3']
})

print(df.merge(df2, on='n', how='inner'))
print('--------')
print(df.merge(df2, on='n', how='left'))

with the result now being:

   n  i str
0  2  0   2
1  2  4   2
2  1  1   1
3  1  3   1
4  3  2   3
5  3  5   3
--------
   n  i str
0  2  0   2
1  1  1   1
2  3  2   3
3  1  3   1
4  2  4   2
5  3  5   3

Note that 2 comes before 1 in the inner example. If you're got an improvement to the current wording, you're welcome to submit a PR.

@topper-123 topper-123 removed the Needs Triage Issue that has not been reviewed by a pandas team member label May 14, 2023
@topper-123
Copy link
Contributor

BTW, I think the current wording is actually correct, but maybe can be clearer. I think I'll close this issue unless you can show an error in the docs, but an improved doc string is still welcome.

@prhbrt
Copy link

prhbrt commented May 17, 2023

BTW, I think the current wording is actually correct, but maybe can be clearer. I think I'll close this issue unless you can show an error in the docs, but an improved doc string is still welcome.

But the order of the key is not preserved at all, the keys are sorted. In this example the order of a should have been preserved, but is reversed.

import pandas as pd
import numpy as np
from IPython.display import display

a = pd.DataFrame(index=-np.arange(4))
b = pd.DataFrame(index=-np.arange(8)//2)
display(pd.merge(a, b, how='left', right_index=True, left_index=True).index)
display(a.index)

Index([-3, -3, -2, -2, -1, -1, 0], dtype='int64')
Index([0, -1, -2, -3], dtype='int64')

It would even be better not to mention it, because now it's mentioned wrong.

@topper-123
Copy link
Contributor

That did not happen in your original post though, this is different. In the original post the keys were ordered by first appearance.

I can see that in the new example the result is sorted, which is not right. So the issue is that "Key order after dataframe inner merge on index" (but not columns) and this is a bug. Do you agree?

@topper-123 topper-123 reopened this May 17, 2023
@wcgonzal
Copy link
Contributor

wcgonzal commented May 23, 2023

Can I work on improving the explanation/documentation provided under pandas.merge() ?
Or is this going to be changed from DOC to BUG?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants