Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG (string): ArrowStringArray.find corner cases #59562

Merged
merged 10 commits into from
Sep 6, 2024

Conversation

jbrockmendel
Copy link
Member

  • closes #xxxx (Replace xxxx with the GitHub issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

@jorisvandenbossche
Copy link
Member

This will only work if we also backport #56792

@jorisvandenbossche jorisvandenbossche added the Strings String extension data type and string data label Aug 20, 2024
@jorisvandenbossche
Copy link
Member

This will only work if we also backport #56792

Backporting that specific PR is of course certainly possible, but if we want to avoid in general to have to backport older fixes, we could also copy the _str_find implementation of ArrowEA to ArrowStringArray (i.e. actually duplicating it), and only do the actual de-duplication later.

@jorisvandenbossche
Copy link
Member

(moving the most up to date implementation directly to the shared mixin (or to ArrowStringArray, depending on questions in #59555), would actually also avoid requiring to backport the older PR)

@jbrockmendel
Copy link
Member Author

@jorisvandenbossche ive lost track of all your comments across several places. pls advise on the preferred course of action.

@jbrockmendel jbrockmendel force-pushed the bug-str-find branch 2 times, most recently from e4c2157 to 3433cec Compare August 26, 2024 21:24
@jorisvandenbossche
Copy link
Member

I think moving the shared _str_find method to the mixin would be best, if possible: that's consistent with how we are sharing code in other PRs now, and that also ensures that we don't have to care about the older backport, because this PR will simply put the fixed _str_find code in a new place, ensuring we get the older fix with it, also for StringDtype.

and not (start != 0 and end is not None)
and not (start == 0 and end is None)
):
# https://github.com/pandas-dev/pandas/pull/59562/files#r1725688888
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link points to this PR, but it doesn't seem to work to actually link to a specific comment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I suppose it points to #59562 (comment)

Now, can't we move that into the mixin as well? (avoiding this override) This is something that was just buggy in pyarrow before that version AFAIU, so I think there is no harm in also doing object-dtype fallback for ArrowDtype, since it otherwise just errors wrongly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh i couldve sworn i copy/pasted it from somewhere, but now i cant find it. will update

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check my question in my second comment above?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About moving the implementation to the mixin? This PR now does that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's about the if block that this thread is commenting on, that is not in the mixin and my question is whether it shouldn't be moved as well? (it's a bug that affects ArrowEA as well AFAIU)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for clarifying. will give it a try.

# Convert an int-dtype arrow result to an appropriate output type.
raise NotImplementedError

def _apply_elementwise(self, func: Callable) -> list[list[Any]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be implemented for ArrowStringArray as well then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is only used for the ArrowEA version. The ArrowStringArray goes through _str_map, which ArrowEA doesn't have. eventually id like to align the names, but there are too many branches/PRs as it is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what I am missing, but _apply_elementwise is called from the now-shared _str_find method just below, and so I would think that you can also get there from ArrowStringArray._str_find ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep my bad. ArrowStringArray inherits ArrowEA so gets its apply_elementwise from there. putting it here just prevents mypy from complaining

@jbrockmendel
Copy link
Member Author

I think comments have been addressed here

@jbrockmendel jbrockmendel force-pushed the bug-str-find branch 2 times, most recently from a33325c to 6e8e2ce Compare August 29, 2024 22:16
Comment on lines +440 to +426
if (
pa_version_under13p0
and not (start != 0 and end is not None)
and not (start == 0 and end is None)
):
# GH#59562
return super()._str_find(sub, start, end)
return self._convert_int_result(result)
return ArrowStringArrayMixin._str_find(self, sub, start, end)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that this special case is moved in the mixin method, I would expect this can be removed entirely? (and replaced with a _str_find = ArrowStringArrayMixin._str_find)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this goes through a cython path instead of iterating in python

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, through _str_map using lib.map_infer_mask, I suppose. But if there is a cython implementation that is presumably faster, shouldn't we use that for the ArrowDtype as well?
I saw that in the center PR at https://github.com/pandas-dev/pandas/pull/59624/files#diff-ca6e5560b2fc1721e129b85f10882df8a1f20b5f1ef4dff547170fa35898dfa6R62 you didn't use _apply_elementwise but also explicitly went through object dtype. That's for the same reason? Can we use the same pattern?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, changed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like doing this broke the min_versions build, so reverted

@jbrockmendel
Copy link
Member Author

jbrockmendel commented Aug 31, 2024 via email

@jorisvandenbossche
Copy link
Member

Yes, but I am talking about two open PRs. Can't we already align in that? (I am not speaking about other code in arrow/array.py, but only about things we now refactor and move to _arrow_string_mixins.py

@jbrockmendel jbrockmendel force-pushed the bug-str-find branch 3 times, most recently from 4b878ce to 28aa96b Compare September 4, 2024 21:58
Comment on lines +1985 to +1987
arrow_str_series = s.astype(pd.StringDtype(storage="pyarrow"))
result2 = arrow_str_series.str.find(sub, start, end).astype(result.dtype)
tm.assert_series_equal(result2, expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future PRs, we should add such tests to pandas/tests/strings, I think (because now it is testing StringDtype in tests specifically for ArrowDtype ..)

@jorisvandenbossche jorisvandenbossche merged commit 3f8d3e4 into pandas-dev:main Sep 6, 2024
46 of 47 checks passed
@jorisvandenbossche
Copy link
Member

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backported Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants