Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Arrow String Array that is compatible with NumPy semantics #54533

Merged
merged 24 commits into from
Aug 23, 2023

Conversation

phofl
Copy link
Member

@phofl phofl commented Aug 13, 2023

cc @jbrockmendel @jorisvandenbossche

I'll do a couple of pre-cursors to reduce the diff

Context: #54792

Comment on lines 457 to 460
def __getattribute__(self, item):
if item in ArrowStringArrayMixin.__dict__:
return partial(getattr(ArrowStringArrayMixin, item), self)
return super().__getattribute__(item)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for doing it this way instead of actually adding the mixin class inheritance?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ArrowStringArray inherits from ArrowExtensionArray as well, that will raise an error because the order can't be figured out. Not suer if there is a workaround that I don't know

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, giving a "diamond" inheritance. Yeah, that doesn't work. I think you could fix that by making a BaseArrowExtensionArray that doesn't yet inherit from the mixin, and then ArrowExtensionArray adds in the mixin. And then here ArrowStringArray could inherit from that base class, or something like that? Not sure that is worth it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a brief comment in the code with that explanation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

pandas/core/arrays/string_arrow.py Outdated Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member

You will also need to update the na_value use for the dtype:

--- a/pandas/core/arrays/string_.py
+++ b/pandas/core/arrays/string_.py
@@ -101,7 +101,10 @@ class StringDtype(StorageExtensionDtype):
     #: StringDtype().na_value uses pandas.NA
     @property
     def na_value(self) -> libmissing.NAType:
-        return libmissing.NA
+        if self.storage == "pyarrow_numpy":
+            return np.nan
+        else:
+            return libmissing.NA

That eg ensures that getitem returns a NaN instead of pd.NA

…semantics

# Conflicts:
#	pandas/core/arrays/_arrow_string_mixins.py
#	pandas/tests/strings/__init__.py
#	pandas/tests/strings/test_strings.py
@jbrockmendel
Copy link
Member

Flying today, will look tomorrow/thursday

@phofl
Copy link
Member Author

phofl commented Aug 14, 2023

@jorisvandenbossche would you be ok with doing this as a follow up? I tested this on my train ride but it broke a bunch of tests (obviously), which would make review even more complicated

phofl and others added 3 commits August 14, 2023 17:21
…semantics_2

# Conflicts:
#	pandas/tests/strings/test_find_replace.py
@phofl
Copy link
Member Author

phofl commented Aug 16, 2023

See #54585 for the na value

pandas/core/arrays/string_arrow.py Outdated Show resolved Hide resolved
Comment on lines 453 to 456
def _result_converter(values, na=None):
if not isna(na):
values = values.fill_null(bool(na))
return ArrowExtensionArray(values).to_numpy(na_value=np.nan)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only if not isna(na)? I assume we will generally always have to fill nulls? (otherwise the to_numpy with nan gives object dtype)

For example (using this branch):

n [19]: pd.Series(["a", "b", None], dtype="string[pyarrow]").str.startswith("a")
Out[19]: 
0     True
1    False
2     <NA>
dtype: boolean

In [20]: pd.Series(["a", "b", None], dtype="string[pyarrow_numpy]").str.startswith("a")
Out[20]: 
0     True
1    False
2      NaN
dtype: object

If we want that the second example has a proper numpy bool dtype that can be used for boolean indexing, we probably should convert the NaN into False? (either with the fill_null call or with to_numpy(na_value=False))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about that. This would break behaviour compared to object dtype, which is something I tried to avoid (and will break a huge number of tests). I think it makes more sense to keep as is for consistency sake

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK, so also the current object-dtype string methods propagate the NaN, wasn't aware of that:

In [5]: pd.Series(["a", "b", None], dtype="object").str.startswith("a")
Out[5]: 
0     True
1    False
2     None
dtype: object

Personally, I think that's something we should change, but that's for another issue then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah happy to discuss that, but maybe better as a breaking change for 3.0? Don't know though. Let's open an issue about that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah happy to discuss that, but maybe better as a breaking change for 3.0? Don't know though. Let's open an issue about that

Opened #54805 for the question whether string predicate methods like startswith should propagate NaN or not

Comment on lines 457 to 460
def __getattribute__(self, item):
if item in ArrowStringArrayMixin.__dict__:
return partial(getattr(ArrowStringArrayMixin, item), self)
return super().__getattribute__(item)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a brief comment in the code with that explanation?

@@ -500,7 +500,7 @@ def use_inf_as_na_cb(key) -> None:
"string_storage",
"python",
string_storage_doc,
validator=is_one_of_factory(["python", "pyarrow"]),
validator=is_one_of_factory(["python", "pyarrow", "pyarrow_numpy"]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want the user to allow setting this for the new option as well? Because we also have the new pd.options.future.infer_strings option to enable the future string dtype?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say yes, you might want to astype columns after operations for example

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably something to further discuss in a follow up issue, but I would expect that if you opt-in for the future string dtype with pd.options.future.infer_strings, that this would also automatically set the "pyarrow_numpy" string storage as default for operations that depend on that (like doing astype("string"))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah agree, let's do this as a follow up

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #54793 for this interaction between pd.options.future.infer_strings and the default string_storage

tm.assert_extension_array_equal(result, expected)

if dtype.storage == "pyarrow_numpy":
expected = np.array([False, False, False], dtype=object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be bool instead of object?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's fair, fixed that.

pandas/tests/arrays/string_/test_string.py Outdated Show resolved Hide resolved
pandas/tests/arrays/test_datetimelike.py Show resolved Hide resolved
pandas/conftest.py Show resolved Hide resolved
Comment on lines +246 to +251
"""
Parametrized fixture for pd.options.mode.string_storage.

* 'python'
* 'pyarrow'
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is added to override the version in the top-level conftest.py to keep only testing "python" and "pyarrow", and that is because the IO methods don't yet properly support "pyarrow_numpy"?
If so, can you add a comment about that here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does work, but tests are tiresome and don't really add value.

The previous tests just checked that string_storage isn't ignored, having 2 or 3 options doesn't make a difference

Comment on lines 281 to 283
if any_string_dtype == "string[pyarrow_numpy]":
pytest.skip("Arrow logic is different")
s = Series(["a", "bb", "cccc", "ddddd", "eeeeee"], dtype=any_string_dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in this case the "string[pyarrow]" version (ArrowStringArray) was still falling back to the python object-dtype implementation? (since we are not skipping that one)

This might be a case where we could add an option to the pyarrow compute kernel that determines if it's aligned left or right for uneven centering?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be the cleanest solution, yes "string[pyarrow]" fell back to python objects

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but I think we should consider for the default string dtype for 3.0 to preserve the current behaviour (and so if necessary for now fallback to the python objects), if our preferred option would be to keep this behaviour long term through a keyword in pyarrow.
(otherwise that would be a change in behaviour again later, when pyarrow would support that)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the above center behaviour, opened #54807 to discuss changing this to fall back for now (to preserve the behaviour)

phofl and others added 7 commits August 21, 2023 10:51
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
…semantics

# Conflicts:
#	pandas/tests/arrays/string_/test_string.py
#	pandas/tests/extension/test_string.py
…semantics

# Conflicts:
#	pandas/tests/strings/__init__.py
@jorisvandenbossche
Copy link
Member

Something that also still needs to be done (for this topic, not necessarily in this PR) is to ensure that the constructors and IO methods that follow pd.options.future.infer_string now use this pyarrow_numpy dtype instead of the plain pyarrow one?

@phofl
Copy link
Member Author

phofl commented Aug 21, 2023

cc @jbrockmendel this should be ready behaviour wise now

@phofl
Copy link
Member Author

phofl commented Aug 22, 2023

@jbrockmendel @jorisvandenbossche are we good to merge here?

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 23, 2023

Yes, let's merge this. You have some follow-up PRs in the line anyway, so any further comment can be addressed in those, I think.

(and I'll open some issues for some of the comments I made)

@jorisvandenbossche jorisvandenbossche merged commit 00f79a3 into pandas-dev:main Aug 23, 2023
33 checks passed
@jorisvandenbossche jorisvandenbossche added this to the 2.1 milestone Aug 23, 2023
@jorisvandenbossche
Copy link
Member

@meeseeksdev backport to 2.1.x

meeseeksmachine pushed a commit to meeseeksmachine/pandas that referenced this pull request Aug 23, 2023
@phofl phofl deleted the string_array_numpy_semantics branch August 23, 2023 16:59
@phofl
Copy link
Member Author

phofl commented Aug 23, 2023

I'll put up a pr for the infer option and the whatsnew later today

phofl added a commit that referenced this pull request Aug 23, 2023
… is compatible with NumPy semantics) (#54713)

Backport PR #54533: Implement Arrow String Array that is compatible with NumPy semantics

Co-authored-by: Patrick Hoefler <61934744+phofl@users.noreply.github.com>
@jorisvandenbossche
Copy link
Member

Opened #54792 as a general follow-up issue tracking the new string dtype topic. And will open some other issues for a few of the remaining discussion items above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants