Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add sort_columns parameter to combine_first #60437

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

U-S-jun
Copy link
Contributor

@U-S-jun U-S-jun commented Nov 28, 2024

  • fixed BUG: combine_first reorders columns #60427
  • [Tests added and passed] if fixing a bug or adding a new feature
  • All [code checks passed]
  • Added [type annotations]to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

This PR enhances the combine_first method in pandas.DataFrame by adding a new parameter, sort_columns, which allows users to control whether the result's column order should be sorted lexicographically or preserve the original order of the calling DataFrame (self).

Currently, combine_first always returns a DataFrame with columns sorted in lexicographical order, which may not be desirable for users who want to maintain the column order of the original DataFrame


df1 = pd.DataFrame({"B": [1, 2], "A": [3, 4]})
df2 = pd.DataFrame({"A": [5]}, index=[1])

result = df1.combine_first(df2)
# Current behavior:
#     A  B
# 0  3  1
# 1  4  2

With the new sort_columns parameter:

Default Behavior (sort_columns=True): Columns remain sorted as before.
New Behavior (sort_columns=False): Columns retain the order from the original DataFrame (self).


df1 = pd.DataFrame({"B": [1, 2], "A": [3, 4]})
df2 = pd.DataFrame({"A": [5]}, index=[1])

result = df1.combine_first(df2, sort_columns=False)
# New behavior:
#     B  A
# 0  1  3
# 1  2  4

Tests: Added new test cases in pandas/tests/frame/methods/test_combine_first.py to validate:
Default behavior with sort_columns=True.
Column order preservation with sort_columns=False.

Documentation:
Updated the docstring for combine_first with examples showcasing the new parameter.
Added a changelog entry in doc/source/whatsnew/v3.0.0.rst.

This enhancement maintains backward compatibility, as the default behavior (sort_columns=True) remains unchanged. The new parameter provides additional flexibility for users who need control over column order.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I'm positive on the linked issue, I'm negative on adding a keyword to sort the order of columns or not. I think it is a bugfix to change the behavior of pandas to preserve column order.

combined = combined.reindex(columns=all_columns, fill_value=None)

if not sort_columns:
combined = combined[self.columns]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this drop columns if there are columns in other that aren't in self?

@rhshadrach rhshadrach added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Dec 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: combine_first reorders columns
2 participants