Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String dtype: map builtin str alias to StringDtype #59685

Merged
merged 26 commits into from
Sep 25, 2024

Conversation

jorisvandenbossche
Copy link
Member

This ensures that the builtin type str is also mapped to the future default string dtype (if infer_string is enabled), i.e. to make .astype(str) behave the same as .astype('str").

In general we rely on numpy for mapping builtin types to dtypes. For example int is mapped to np.dtype("int64") by passing that value to np.dtype(..) in pandas_dtype().
But now that we have a default dtype that is mapped to one of those builtin types, we can no longer rely on numpy doing that. Implementation wise, for pandas_dtype(str) I currently added a special case checking explicitly for str and then importing the dtype lazily and returning it.

An alternative could be to handle this through the ExtensionDtype Registry.find(). However, that currently relies on using ExtensionDtype.construct_from_string() (i.e. we would have to expand the typing there to allow type objects as well next to strings, which is of course also possible). And maybe it is better to have control centrally about builtin aliases, instead of allowing external dtypes to map to those as well.

xref #54792

@jorisvandenbossche jorisvandenbossche added the Strings String extension data type and string data label Sep 2, 2024
@jorisvandenbossche jorisvandenbossche added this to the 2.3 milestone Sep 2, 2024
if (
hasattr(arr, "dtype")
and arr.dtype.kind in "mM"
and not hasattr(arr, "_pa_array")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit ugly, but essentially this is avoiding that astype(str) for ArrowExtensionArray (if that happens to have a datetimelike dtype) is called here, because that will use ensure_string_array for that implementation, causing a recursion error.

An alternative would be a better check than arr.dtype.kind in "mM" that restricts itself to DatetimeLikeArrayMixin (essentially I need a isinstance(arr, DatetimeLikeArrayMixin), I think). Could potentially do that with some ABC check.

Another alternative is to handle astype(str) specifically in ArrowExtensionArray (currently astype is not implemented there, and it falls back entirely on the base class implementation)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think eventually it would be good to implement ArrowExtensionArray.astype so I would support eventually handing str there. Could you add a TODO comment here describing when this could be removed when astype is implemented?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a TODO comment

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment otherwise looks good

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks great - some minor things

@@ -1448,6 +1450,9 @@ def is_extension_array_dtype(arr_or_dtype) -> bool:
elif isinstance(dtype, np.dtype):
return False
else:
# TODO ugly -> move into registry find()? Or make this work with pandas_dtype?
if dtype is str and using_string_dtype():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there other places where maybe we should just have a function like is_builtin_string_dtype instead of this branch? I think its hard to mentally process what logic like this is doing - would be good to abstract that as much as we can

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd I replaced the registry.find() with pandas_dtype(), then I don't need to handle the case of str here specifically because pandas_dtype() already does that.

In general pandas_dtype() should be our general preprocessing function. I only needed to suppress any warning or error here, because in this case if pandas_dtype fails then is_extension_array_dtype should return False.
(potentially that could be factored out into something like pandas_dtype_safe() if that would be needed in multiple places)

@@ -184,7 +184,7 @@ def test_construct_from_string_fill_value_raises(string):
[
(SparseDtype(int, 0), float, SparseDtype(float, 0.0)),
(SparseDtype(int, 1), float, SparseDtype(float, 1.0)),
(SparseDtype(int, 1), str, SparseDtype(object, "1")),
(SparseDtype(int, 1), np.str_, SparseDtype(object, "1")),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we even need this test case? I'm not sure what expectations we have around the np.str_ data type

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, can also leave out this test case entirely. I am not entirely sure yet what dtype=np.str_ should mean, I was planning to open an issue about that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be fine to remove; I feel like this opens up a pandora's box of issues that aren't worth tracking down at this point in time

pandas/tests/frame/test_constructors.py Show resolved Hide resolved
@mroeschke mroeschke merged commit a9e30c5 into pandas-dev:main Sep 25, 2024
50 of 51 checks passed
@mroeschke
Copy link
Member

Thanks @jorisvandenbossche

return (
self._is_comparable_dtype(dtype)
or is_object_dtype(dtype)
or is_string_dtype(dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the is_string_dtype case needed here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we allow to compare non-strings with strings for certain cases, I guess?

In [4]: pd.date_range("2020-01-01", periods=3) == "2020-01-01"
Out[4]: array([ True, False, False])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm looks like two tests fail if this is disabled, test_map_missing_mixed, test_get_indexer_datetime. The map test I'd have to look into, but the get_indexer one make sense:

ii = IntervalIndex.from_breaks(date_range("2018-01-01", periods=4))
target = DatetimeIndex(["2018-01-02"], dtype="M8[ns]")
result = ii.get_indexer(target.astype(str))
> tm.assert_numpy_array_equal(result, expected)

I suspect there are dtype combinations for which we don't need the is_string_dtype check here, might be able to move it to _is_comparable_dtype and restore fast-paths in some cases.

@@ -712,7 +713,7 @@ def _get_indexer(
# left/right get_indexer, compare elementwise, equality -> match
indexer = self._get_indexer_unique_sides(target)

elif not is_object_dtype(target.dtype):
elif not (is_object_dtype(target.dtype) or is_string_dtype(target.dtype)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we exclude string dtype here? id expect _get_indexer to fastpath to all-minus-ones

@jorisvandenbossche jorisvandenbossche deleted the string-dtype-astype-str branch October 10, 2024 08:48
jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this pull request Oct 10, 2024
* String dtype: map builtin str alias to StringDtype

* fix tests

* fix datetimelike astype and more tests

* remove xfails

* try fix typing

* fix copy_view tests

* fix remaining tests with infer_string enabled

* ignore typing issue for now

* move to common.py

* simplify Categorical._str_get_dummies

* small cleanup

* fix ensure_string_array to not modify extension arrays inplace

* fix ensure_string_array once more + fix is_extension_array_dtype for str

* still xfail TestArrowArray::test_astype_str when not using infer_string

* ensure maybe_convert_objects copies object dtype input array when inferring StringDtype

* update test_1d_object_array_does_not_copy test

* update constructor copy test + do not copy in maybe_convert_objects?

* skip str.get_dummies test for now

* use pandas_dtype() instead of registry.find

* fix corner cases for calling pandas_dtype

* add TODO comment in ensure_string_array
jorisvandenbossche added a commit that referenced this pull request Oct 10, 2024
* String dtype: map builtin str alias to StringDtype

* fix tests

* fix datetimelike astype and more tests

* remove xfails

* try fix typing

* fix copy_view tests

* fix remaining tests with infer_string enabled

* ignore typing issue for now

* move to common.py

* simplify Categorical._str_get_dummies

* small cleanup

* fix ensure_string_array to not modify extension arrays inplace

* fix ensure_string_array once more + fix is_extension_array_dtype for str

* still xfail TestArrowArray::test_astype_str when not using infer_string

* ensure maybe_convert_objects copies object dtype input array when inferring StringDtype

* update test_1d_object_array_does_not_copy test

* update constructor copy test + do not copy in maybe_convert_objects?

* skip str.get_dummies test for now

* use pandas_dtype() instead of registry.find

* fix corner cases for calling pandas_dtype

* add TODO comment in ensure_string_array
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backported Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants