Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add sort_columns parameter to combine_first #60437

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
ENH: Add sort_columns parameter to combine_first
  • Loading branch information
U-S-jun committed Nov 28, 2024
commit 6333c3b906d86b5bf2072012fa910ea05c766c40
27 changes: 26 additions & 1 deletion pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -8712,7 +8712,7 @@ def combine(
frame_result = self._constructor(result, index=new_index, columns=new_columns)
return frame_result.__finalize__(self, method="combine")

def combine_first(self, other: DataFrame) -> DataFrame:
def combine_first(self, other: DataFrame, sort_columns=True) -> DataFrame:
"""
Update null elements with value in the same location in `other`.

Expand All @@ -8728,6 +8728,10 @@ def combine_first(self, other: DataFrame) -> DataFrame:
----------
other : DataFrame
Provided DataFrame to use to fill null values.
sort_columns : bool, default True
Whether to sort the columns in the result DataFrame. If False, the
order of the columns in `self` is preserved.


Returns
-------
Expand All @@ -8741,13 +8745,25 @@ def combine_first(self, other: DataFrame) -> DataFrame:

Examples
--------
Default behavior with `sort_columns=True` (default):

>>> df1 = pd.DataFrame({"A": [None, 0], "B": [None, 4]})
>>> df2 = pd.DataFrame({"A": [1, 1], "B": [3, 3]})
>>> df1.combine_first(df2)
A B
0 1.0 3.0
1 0.0 4.0


Preserving the column order of `self` with `sort_columns=False`:

>>> df1 = pd.DataFrame({"B": [None, 4], "A": [0, None]})
>>> df2 = pd.DataFrame({"A": [1, 1], "B": [3, 3]})
>>> df1.combine_first(df2, sort_columns=False)
B A
0 3.0 0.0
1 4.0 1.0

Null values still persist if the location of that null value
does not exist in `other`

Expand All @@ -8773,6 +8789,8 @@ def combiner(x: Series, y: Series):
return y_values

return expressions.where(mask, y_values, x_values)

all_columns = self.columns.union(other.columns)

if len(other) == 0:
combined = self.reindex(
Expand All @@ -8790,6 +8808,13 @@ def combiner(x: Series, y: Series):

if dtypes:
combined = combined.astype(dtypes)

combined = combined.reindex(columns=all_columns, fill_value=None)

if not sort_columns:
combined = combined[self.columns]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this drop columns if there are columns in other that aren't in self?




return combined.__finalize__(self, method="combine_first")

Expand Down
8 changes: 8 additions & 0 deletions pandas/tests/frame/methods/test_combine_first.py
Original file line number Diff line number Diff line change
Expand Up @@ -560,3 +560,11 @@ def test_combine_first_empty_columns():
result = left.combine_first(right)
expected = DataFrame(columns=["a", "b", "c"])
tm.assert_frame_equal(result, expected)

def test_combine_first_column_order():
df1 = pd.DataFrame({"B": [1, 2], "A": [3, 4]})
df2 = pd.DataFrame({"A": [5]}, index=[1])

result = df1.combine_first(df2,sort_columns=False)
expected = pd.DataFrame({"B": [1, 2], "A": [3, 4]})
pd.testing.assert_frame_equal(result, expected)