Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String dtype: fix isin() values handling for python storage #59759

Merged
merged 3 commits into from
Sep 12, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion pandas/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -1310,7 +1310,13 @@ def string_storage(request):
pytest.param(("pyarrow", pd.NA), marks=td.skip_if_no("pyarrow")),
pytest.param(("pyarrow", np.nan), marks=td.skip_if_no("pyarrow")),
("python", np.nan),
]
],
ids=[
"string=string[python]",
"string=string[pyarrow]",
"string=str[pyarrow]",
"string=str[python]",
],
)
def string_dtype_arguments(request):
"""
Expand Down Expand Up @@ -1341,6 +1347,7 @@ def dtype_backend(request):

# Alias so we can test with cartesian product of string_storage
string_storage2 = string_storage
string_dtype_arguments2 = string_dtype_arguments


@pytest.fixture(params=tm.BYTES_DTYPES)
Expand Down
18 changes: 18 additions & 0 deletions pandas/core/arrays/string_.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@
nanops,
ops,
)
from pandas.core.algorithms import isin
from pandas.core.array_algos import masked_reductions
from pandas.core.arrays.base import ExtensionArray
from pandas.core.arrays.floating import (
Expand All @@ -65,6 +66,7 @@
import pyarrow

from pandas._typing import (
ArrayLike,
AxisInt,
Dtype,
DtypeObj,
Expand Down Expand Up @@ -731,6 +733,22 @@ def _putmask(self, mask: npt.NDArray[np.bool_], value) -> None:
# base class implementation that uses __setitem__
ExtensionArray._putmask(self, mask, value)

def isin(self, values: ArrayLike) -> npt.NDArray[np.bool_]:
if not isinstance(values, BaseStringArray):
if not lib.is_string_array(np.asarray(values), skipna=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth short-circuiting for the case where values is all NA and self._hasna is False?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is it worth short-circuiting this check when the values.dtype is ArrowDtype(pa.string())?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth short-circuiting for the case where values is all NA and self._hasna is False?

I would say that is probably not worth the added complexity for something that is just speeding up an uncommon corner case? (also, checking self._hasna has a cost)

Also, is it worth short-circuiting this check when the values.dtype is ArrowDtype(pa.string())?

I wanted to do that, but then I would need to import ArrowDtype and pyarrow inline (because this file is for the object string dtype, that works without pyarrow), so at that point I was again unsure if we should add a special case here

(I think a typical use case is passing a list, and that will at the moment never be coerced to ArrowDtype before it gets here)

Now, what we should maybe do to essentially speed up this case is to ensure lib.is_string_array shortcuts on pyarrow strings and directly returns True, and ensure that StringArray._from_sequence shortcuts on pyarrow strings to ensure the most efficient conversion (if it doesn't do that already).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I can probably check for ArrowDtype(pa.string()) without importing it by comparing for equality with "string[pyarrow]"

values = np.array(
[val for val in values if isinstance(val, str) or isna(val)],
dtype=object,
)
if not len(values):
return np.zeros(self.shape, dtype=bool)

values = self._from_sequence(values, dtype=self.dtype)
Comment on lines +746 to +754
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ensures that we convert the input values to a StringArray first, to make use of the construction validation (coercing to one missing value sentinel).
But, this is only done if we actually get strings passed, and if not, we filter out non-strings. This is also what is done in the ArrowStringArray implementation, and this is to ensure we accept something like isin(["string", 10]) with mixed-type values (I do wonder if we should deprecate that in general, but that's for a later, separate discussion).

else:
values = values.astype(self.dtype, copy=False)

return isin(np.asarray(self), np.asarray(values))

def astype(self, dtype, copy: bool = True):
dtype = pandas_dtype(dtype)

Expand Down
24 changes: 19 additions & 5 deletions pandas/tests/arrays/string_/test_string.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,12 @@ def dtype(string_dtype_arguments):
return pd.StringDtype(storage=storage, na_value=na_value)


@pytest.fixture
def dtype2(string_dtype_arguments2):
storage, na_value = string_dtype_arguments2
return pd.StringDtype(storage=storage, na_value=na_value)


@pytest.fixture
def cls(dtype):
"""Fixture giving array type from parametrized 'dtype'"""
Expand Down Expand Up @@ -665,11 +671,7 @@ def test_isin(dtype, fixed_now_ts):
tm.assert_series_equal(result, expected)

result = s.isin(["a", pd.NA])
if dtype.storage == "python" and dtype.na_value is np.nan:
# TODO(infer_string) we should make this consistent
expected = pd.Series([True, False, False])
else:
expected = pd.Series([True, False, True])
expected = pd.Series([True, False, True])
tm.assert_series_equal(result, expected)

result = s.isin([])
Expand All @@ -681,6 +683,18 @@ def test_isin(dtype, fixed_now_ts):
tm.assert_series_equal(result, expected)


def test_isin_string_array(dtype, dtype2):
s = pd.Series(["a", "b", None], dtype=dtype)

result = s.isin(pd.array(["a", "c"], dtype=dtype2))
expected = pd.Series([True, False, False])
tm.assert_series_equal(result, expected)

result = s.isin(pd.array(["a", None], dtype=dtype2))
expected = pd.Series([True, False, True])
tm.assert_series_equal(result, expected)


def test_setitem_scalar_with_mask_validation(dtype):
# https://github.com/pandas-dev/pandas/issues/47628
# setting None with a boolean mask (through _putmaks) should still result
Expand Down
Loading