Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REF (string): de-duplicate ArrowStringArray methods #59555

Merged
merged 9 commits into from
Sep 11, 2024

Conversation

jbrockmendel
Copy link
Member

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

@jbrockmendel jbrockmendel changed the title REF: de-duplicate ArrowStringArray methods REF (string): de-duplicate ArrowStringArray methods Aug 20, 2024
@jorisvandenbossche
Copy link
Member

Could we do it the other way around? i.e. have the method in ArrowExtensionArray point to the one defined in ArrowStringArray?
Personally, I feel it better belongs in string_arrow.py, to keep the string specific functionality in there (the ArrowExtensionArray definition is already huge ..)

@jbrockmendel
Copy link
Member Author

Could we do it the other way around? i.e. have the method in ArrowExtensionArray point to the one defined in ArrowStringArray?

I think the thing to do is put it in the ArrowStringArrayMixin. I'm vaguely planning to move all the relevant stuff up there after the current crop of branches (some still local) are done.

@mroeschke
Copy link
Member

If we're doing this moving of methods away from ArrowExtensionArray, using pandas.ArrowDtype(pyarrow.string()) will continue to use NA semantics correct?

@mroeschke mroeschke added Refactor Internal refactoring of code Strings String extension data type and string data Arrow pyarrow functionality labels Aug 21, 2024
@jbrockmendel
Copy link
Member Author

jbrockmendel commented Aug 21, 2024 via email

@jorisvandenbossche
Copy link
Member

I'm vaguely planning to move all the relevant stuff up there after the current crop of branches (some still local) are done.

If that's the plan, my preference would be do to it at once, instead of spread over multiple PRs. I know we generally want smaller more granular PRs, but given that this will need to be backported, doing this in one go instead of multiple steps will make that easier.

@jorisvandenbossche
Copy link
Member

I think the thing to do is put it in the ArrowStringArrayMixin.

Also a question on the mixin. What's the reason we need the mixin? (compared to keeping things in ArrowStringArray itself)

Right now ArrowStringArray doesn't actually inherit from the mixin, but uses explicit _str_method = ArrowStringArrayMixin._str_method (well, that's actually only done in ArrowStringArrayNumpySemantics, I see, and not in ArrowStringArray, that's another question).
So using that pattern, we could also keep things in ArrowStringArray, and add the methods to ArrowExtensionArray by doing _str_method = ArrowStringArray._str_method in that class.

Or is the idea that ArrowStringArray(NumpySemantics) eventually actually inherits from ArrowStringArrayMixin?

@jbrockmendel
Copy link
Member Author

Also a question on the mixin. What's the reason we need the mixin? (compared to keeping things in ArrowStringArray itself)

It is not really needed. ArrowEA subclasses the mixin and ArrowStringArray subclasses ArrowEA. So all the methods could go directly on ArrowEA, its just that class is already really bulky. Putting the methods directly on ArrowStringArray and reversing the inheritance would surely be possible but that would be more invasive.

@jorisvandenbossche
Copy link
Member

Ah, yes, I forgot about ArrowStringArray subclassing ArrowExtensionArray and therefore also subclassing the mixin.
But the problem with that inheritance is that currently it's not actually used in ArrowStringArray, right? Because ArrowStringArray also subclasses ObjectStringArrayMixin which will get chosen first as fall-back for a method that doesn't exist on ArrowStringArray

That means we should either add the methods explicitly with _str_method = ArrowStringArrayMixin._str_method, or let ArrowStringArray subclass ArrowStringArrayMixin directly with the proper order. However, I think the latter does not work? ("Cannot create a consistent method resolution order (MRO)")

@jbrockmendel
Copy link
Member Author

That means we should either add the methods explicitly with _str_method = ArrowStringArrayMixin._str_method

Yes, one of my local de-dup branches does this for the relevant methods.

_str_strip = ArrowExtensionArray._str_strip
_str_lstrip = ArrowExtensionArray._str_lstrip
_str_rstrip = ArrowExtensionArray._str_rstrip
_str_removesuffix = ArrowStringArrayMixin._str_removesuffix
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_str_removesuffix is also assigned in ArrowStringArrayNumpySemantics below. That can then be removed there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, will update

@jorisvandenbossche
Copy link
Member

If we deduplicate those methods, my preference would be to directly move those methods from ArrowEA to the mixin, and get them from there in ArrowStringArray. Or is there a reason we do not do that here, but use that pattern (shared methods in the mixin instead of in ArrowEA) for other methods?

@jbrockmendel
Copy link
Member Author

my preference would be to directly move those methods from ArrowEA to the mixin, and get them from there in ArrowStringArray.

Sure, will update. Will wait until after #59615, #59568, #59562 and then rebase on top of those to minimize headaches.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Some small comments

else:
result = pc.utf8_rtrim(self._pa_array, characters=to_strip)
return type(self)(result)
return ArrowExtensionArray._str_slice(self, start=start, stop=stop, step=step)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return ArrowExtensionArray._str_slice(self, start=start, stop=stop, step=step)
return ArrowStringArrayMixin._str_slice(self, start=start, stop=stop, step=step)

?

I see you didn't actually move _str_slice so the above wouldn't work. But could you move that as well? (most other methods you are touching here are now in the mixin class I think)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also leave out slice because you are fixing that in a separate PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will update after the slice PR goes in

removed = pc.utf8_slice_codeunits(self._pa_array, len(prefix))
result = pc.if_else(starts_with, removed, self._pa_array)
return type(self)(result)
return ArrowExtensionArray._str_removeprefix(self, prefix)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return ArrowExtensionArray._str_removeprefix(self, prefix)
return ArrowStringArrayMixin._str_removeprefix(self, prefix)

Just for consistency, because for the direct assignments we point to ArrowStringArrayMixin

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will update

_str_slice_replace = ArrowStringArrayMixin._str_slice_replace
_str_len = ArrowStringArrayMixin._str_len

_rank = ArrowExtensionArray._rank
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is strictly speaking not needed? (because it is only otherwise defined on the base EA class, so ArrowEA comes first in the method resolution order)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will remove

@jorisvandenbossche
Copy link
Member

There are a couple of failures after the latest update

@jorisvandenbossche
Copy link
Member

Those failures are happening on other PRs as well, I think, so merging.

@jorisvandenbossche jorisvandenbossche merged commit 4444e52 into pandas-dev:main Sep 11, 2024
42 of 47 checks passed
@jbrockmendel jbrockmendel deleted the ref-dedup branch September 18, 2024 18:48
jorisvandenbossche pushed a commit to jorisvandenbossche/pandas that referenced this pull request Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality backported Refactor Internal refactoring of code Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants