Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String dtype: fix isin() values handling for python storage #59759

Merged
merged 3 commits into from
Sep 12, 2024

Conversation

jorisvandenbossche
Copy link
Member

This PR adds a StringArray.isin() implementation (ArrowStringArray already has a custom isin()), so we can add custom processing of the passed values set compared to the base class implementation.

See also the discussion at #58451 (comment) about the expected behaviour.

xref #54792

@jorisvandenbossche jorisvandenbossche added the Strings String extension data type and string data label Sep 9, 2024
Comment on lines +738 to +746
if not lib.is_string_array(np.asarray(values), skipna=True):
values = np.array(
[val for val in values if isinstance(val, str) or isna(val)],
dtype=object,
)
if not len(values):
return np.zeros(self.shape, dtype=bool)

values = self._from_sequence(values, dtype=self.dtype)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ensures that we convert the input values to a StringArray first, to make use of the construction validation (coercing to one missing value sentinel).
But, this is only done if we actually get strings passed, and if not, we filter out non-strings. This is also what is done in the ArrowStringArray implementation, and this is to ensure we accept something like isin(["string", 10]) with mixed-type values (I do wonder if we should deprecate that in general, but that's for a later, separate discussion).

@@ -731,6 +733,22 @@ def _putmask(self, mask: npt.NDArray[np.bool_], value) -> None:
# base class implementation that uses __setitem__
ExtensionArray._putmask(self, mask, value)

def isin(self, values: ArrayLike) -> npt.NDArray[np.bool_]:
if not isinstance(values, BaseStringArray):
if not lib.is_string_array(np.asarray(values), skipna=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth short-circuiting for the case where values is all NA and self._hasna is False?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is it worth short-circuiting this check when the values.dtype is ArrowDtype(pa.string())?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth short-circuiting for the case where values is all NA and self._hasna is False?

I would say that is probably not worth the added complexity for something that is just speeding up an uncommon corner case? (also, checking self._hasna has a cost)

Also, is it worth short-circuiting this check when the values.dtype is ArrowDtype(pa.string())?

I wanted to do that, but then I would need to import ArrowDtype and pyarrow inline (because this file is for the object string dtype, that works without pyarrow), so at that point I was again unsure if we should add a special case here

(I think a typical use case is passing a list, and that will at the moment never be coerced to ArrowDtype before it gets here)

Now, what we should maybe do to essentially speed up this case is to ensure lib.is_string_array shortcuts on pyarrow strings and directly returns True, and ensure that StringArray._from_sequence shortcuts on pyarrow strings to ensure the most efficient conversion (if it doesn't do that already).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I can probably check for ArrowDtype(pa.string()) without importing it by comparing for equality with "string[pyarrow]"

@jorisvandenbossche jorisvandenbossche added this to the 2.3 milestone Sep 11, 2024
@mroeschke mroeschke merged commit 0d2505d into pandas-dev:main Sep 12, 2024
42 of 47 checks passed
@mroeschke
Copy link
Member

Thanks @jorisvandenbossche

@jorisvandenbossche jorisvandenbossche deleted the string-dtype-isin branch September 13, 2024 20:54
jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this pull request Oct 10, 2024
…ev#59759)

* String dtype: fix isin() values handling for python storage

* address feedback
jorisvandenbossche added a commit that referenced this pull request Oct 10, 2024
* String dtype: fix isin() values handling for python storage

* address feedback
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backported Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants