Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Implement more efficient merge for arrow strings #54443

Merged
merged 4 commits into from
Aug 8, 2023

Conversation

phofl
Copy link
Member

@phofl phofl commented Aug 6, 2023

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

lets see if this passes ci

def random_string():
    return ''.join(random.choices(string.ascii_letters, k=7))


df = pd.DataFrame({"a": pd.Series([random_string() for _ in range(3_000_000)], dtype=pd.ArrowDtype(pa.string())), "b": 1})
df2 = pd.DataFrame({"a": pd.Series([random_string() for _ in range(1_000_000)], dtype=pd.ArrowDtype(pa.string())), "c": 1})

df.merge(df2, how="outer")

Main: 1.4701979160308838
PR: 0.6503970623016357

):
lk, _ = lk._values_for_factorize()

# error: Item "ndarray" of "Union[Any, ndarray]" has no attribute
# "_values_for_factorize"
rk, _ = rk._values_for_factorize() # type: ignore[union-attr]
elif isinstance(lk.dtype, ArrowDtype) and is_string_dtype(lk.dtype):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no kidding my reaction on reading this was to sigh and then say out loud "gross". let's take this opportunity to implement the appropriate EA method

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) As discussed, let's do as a follow up. But I agree that this is needed

@mroeschke mroeschke added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Arrow pyarrow functionality labels Aug 7, 2023
@phofl phofl added this to the 2.1 milestone Aug 7, 2023
@mroeschke
Copy link
Member

A whatsnew note would be good otherwise LGTM

@mroeschke mroeschke merged commit c1e309b into pandas-dev:main Aug 8, 2023
32 checks passed
@mroeschke
Copy link
Member

Thanks @phofl

@phofl phofl deleted the merge_arrow_strings branch August 9, 2023 05:03
)
if how == "right":
return rlab, llab, count
return llab, rlab, count
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the sort keyword matter somewhere in here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up. This didn't cover sort. Wanted to get the other PR in first

mroeschke pushed a commit to mroeschke/pandas that referenced this pull request Aug 18, 2023
* ENH: Implement more efficient merge for arrow strings

* Fix typing

* Update

* ENH: Implement more efficient merge for arrow strings
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants