Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: join/merge of DataFrame does not keep order of index #40608

Closed
2 tasks done
sdementen opened this issue Mar 24, 2021 · 2 comments · Fixed by #54611
Closed
2 tasks done

BUG: join/merge of DataFrame does not keep order of index #40608

sdementen opened this issue Mar 24, 2021 · 2 comments · Fixed by #54611
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@sdementen
Copy link
Contributor

  • I have checked that this issue has not already been reported.

The issue may be related to issue #34133

  • I have confirmed this bug exists on the latest version of pandas.

Code Sample, a copy-pastable example

import pandas

idx_single = pandas.RangeIndex(10)
idx_double = pandas.Index(idx_single.tolist() + idx_single.tolist())

# idx_double
df1 = pandas.DataFrame(index=idx_double)
df2 = pandas.DataFrame(index=idx_single, data=1, columns=["one", "two"])
print(df1.index)
print(df1.join(df2, how="left", sort=False).index)
print(df1.merge(df2, how="left", right_index=True, left_index=True, sort=False).index)

Problem description

The index of the joined (or merged) dataframe is not the same as the left hand dataframe even though sort=False.

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
Int64Index([0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9], dtype='int64')
Int64Index([0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9], dtype='int64')

Expected Output

I would expect the index of the joined/merged dataframe to be the same as the left hand dataframe.

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')

Output of pd.show_versions()

INSTALLED VERSIONS

commit : f2c8480
python : 3.9.0.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.18362
machine : AMD64
processor : Intel64 Family 6 Model 142 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : French_Belgium.1252

pandas : 1.2.3
numpy : 1.20.1
pytz : 2021.1
dateutil : 2.8.1
pip : 21.0.1
setuptools : 49.2.0

@sdementen sdementen added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 24, 2021
@rhshadrach rhshadrach added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Mar 25, 2021
@mroeschke mroeschke removed the Needs Triage Issue that has not been reviewed by a pandas team member label Aug 19, 2021
@EuanRichard
Copy link

EuanRichard commented Aug 25, 2021

I believe this bug occurs when there are duplicate entries in the index of the caller. Can test this as follows (Pandas 1.3.2):

>>> df1 = pd.DataFrame(index=[4,3,2,2], columns=['col1'], data=['a','b','c','d'])
>>> df2 = pd.DataFrame(index=[1,2,3,4], columns=['col2'], data=['i','j','k','l'])
>>> print(df1.join(df2)) # By default, how='left' and sort=False

  col1 col2
2    c    j
2    d    j
3    b    k
4    a    l

Resulting index out of order above, compared to df1 index. But, if we change df1 to have unique values in index:

>>> df1.index = [4,3,1,2]
>>> print(df1.join(df2))

  col1 col2
4    a    l
3    b    k
1    c    i
2    d    j

We retain the original index as expected.

@SergeyHein
Copy link

SergeyHein commented Jan 21, 2023

Similar unexpected behavior is observed when using how = "outer"
Code

import pandas as pd
print(f"{pd.__version__=}")
k1 = [5,1,10,2,30]
k2 = [30,5,6,1]
df1=pd.DataFrame({"a":k1, "x1":[100+v for v in k1]}).set_index("a")
df2=pd.DataFrame({"a":k2, "x2":[200+ v for v in k2]}).set_index("a")
v1_unsorted = pd.merge(df1,df2, left_index=True, right_index=True, how = "outer",sort=False)
v2_unsorted = pd.merge(df1,df2, on ="a",how = "outer",sort=False)

v1_sorted = pd.merge(df1,df2, left_index=True, right_index=True, how = "outer",sort=True)
v2_sorted = pd.merge(df1,df2, on ="a",how = "outer",sort=True)

print(f"{v1_unsorted.index=}")
print(f"{v2_unsorted.index=}")

print(f"{v1_sorted.index=}")
print(f"{v2_sorted.index=}")

Output

pd.__version__='1.5.3'
v1_unsorted.index=Int64Index([1, 2, 5, 6, 10, 30], dtype='int64', name='a')
v2_unsorted.index=Int64Index([5, 1, 10, 2, 30, 6], dtype='int64', name='a')
v1_sorted.index=Int64Index([1, 2, 5, 6, 10, 30], dtype='int64', name='a')
v2_sorted.index=Int64Index([1, 2, 5, 6, 10, 30], dtype='int64', name='a')

Expectation is that v2_unsorted.index index is generated by merge when sort = False ( what is by defaut)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants