Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Implement more efficient merge for arrow strings #54443

Merged
merged 4 commits into from
Aug 8, 2023
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 26 additions & 1 deletion pandas/core/reshape/merge.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@
is_number,
is_numeric_dtype,
is_object_dtype,
is_string_dtype,
needs_i8_conversion,
)
from pandas.core.dtypes.dtypes import (
Expand Down Expand Up @@ -2401,13 +2402,37 @@ def _factorize_keys(
if not isinstance(lk, BaseMaskedArray) and not (
# exclude arrow dtypes that would get cast to object
isinstance(lk.dtype, ArrowDtype)
and is_numeric_dtype(lk.dtype.numpy_dtype)
and (
is_numeric_dtype(lk.dtype.numpy_dtype)
or is_string_dtype(lk.dtype)
and not sort
)
):
lk, _ = lk._values_for_factorize()

# error: Item "ndarray" of "Union[Any, ndarray]" has no attribute
# "_values_for_factorize"
rk, _ = rk._values_for_factorize() # type: ignore[union-attr]
elif isinstance(lk.dtype, ArrowDtype) and is_string_dtype(lk.dtype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no kidding my reaction on reading this was to sigh and then say out loud "gross". let's take this opportunity to implement the appropriate EA method

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) As discussed, let's do as a follow up. But I agree that this is needed

import pyarrow as pa
import pyarrow.compute as pc

len_lk = len(lk)
lk = lk._pa_array # type: ignore[union-attr, attr-defined]
rk = rk._pa_array # type: ignore[union-attr, attr-defined]
dc = pa.concat_arrays(
[lk.combine_chunks(), rk.combine_chunks()]
).dictionary_encode()
lukemanley marked this conversation as resolved.
Show resolved Hide resolved
length = len(dc.dictionary)

llab, rlab, count = (
pc.fill_null(dc.indices[slice(len_lk)], length).to_numpy(),
pc.fill_null(dc.indices[slice(len_lk, None)], length).to_numpy(),
len(dc.dictionary),
)
if how == "right":
return rlab, llab, count
return llab, rlab, count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the sort keyword matter somewhere in here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up. This didn't cover sort. Wanted to get the other PR in first


if needs_i8_conversion(lk.dtype) and lk.dtype == rk.dtype:
# GH#23917 TODO: Needs tests for non-matching dtypes
Expand Down
Loading