Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow string array dtype #36142

Closed

Conversation

TomAugspurger
Copy link
Contributor

This is a continuation of #35259, updated to use a single StringDtype parameterized by storage (python or pyarrow). We do use different classes for the array though.

I wouldn't recommend reviewing this yet. In particular, I need to refactor the string ops implementation. That will likely be done in a separate PR to keep things smaller here.

In [1]: import pandas as pd

In [2]: a = pd.Series(pd.array([pd.NA, pd.NA], dtype=pd.StringDtype("python")))

In [3]: b = pd.Series(pd.array([pd.NA, pd.NA], dtype=pd.StringDtype("pyarrow")))

In [4]: a.array
Out[4]:
<StringArray>
[<NA>, <NA>]
Length: 2, dtype: StringDtype[python]

In [5]: b.array
Out[5]:
<ArrowStringArray>
[<NA>, <NA>]
Length: 2, dtype: StringDtype[pyarrow]

Closes #35169

@TomAugspurger
Copy link
Contributor Author

Just FYI @xhochy.

@xhochy
Copy link
Contributor

xhochy commented Sep 5, 2020

@TomAugspurger I would continue to work on this starting next week again. Do we want to coordinate somehow?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I know you said to not yet review, but was just curious ;) The dtype dispatching to two different array classes looks good to me)

@@ -459,6 +461,7 @@ def astype(self, dtype, copy=True):
from pandas.core.arrays.string_ import StringDtype

dtype = pandas_dtype(dtype)
# FIXME: Really hard-code here?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs the general 'astype' machinery refactor to fix this, I think

========================== ==============
``'string'`` global default
``'string[python]'`` python
``'StringDtype[python]'`` python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why allow the full class name "StringDtype" as string as well? (I don't think we do that for other dtypes?)

EDIT: ah, I see you used that as name above. But so then: why not "string" as name?


@property
def name(self):
return f"StringDtype[{self.storage}]"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return f"StringDtype[{self.storage}]"
return f"string[{self.storage}]"

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, this seems better.


Examples
--------
>>> pd.array(['This is', 'some text', None, 'data.'], dtype="arrow_string")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
>>> pd.array(['This is', 'some text', None, 'data.'], dtype="arrow_string")
>>> pd.array(['This is', 'some text', None, 'data.'], dtype="string[arrow]")

def _from_sequence(cls, scalars, dtype=None, copy=False):
# TODO(ARROW-9407): Accept pd.NA in Arrow
scalars_corrected = [None if isna(x) else x for x in scalars]
return cls(pa.array(scalars_corrected, type=pa.string()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return cls(pa.array(scalars_corrected, type=pa.string()))
return cls(pa.array(scalars_corrected, type=pa.string(), from_pandas=True))

then np.nan and pd.NA will be accepted as "null" as well, in addition to None

@jbrockmendel jbrockmendel added the Strings String extension data type and string data label Sep 14, 2020
@TomAugspurger
Copy link
Contributor Author

Closing in favor of #35259 (hopefully)

@simonjayhawkins
Copy link
Member

@TomAugspurger @xhochy OK to merge these changes into #35259?

@TomAugspurger
Copy link
Contributor Author

I'm not sure if we have complete agreement over how this will be exposed to the user yet in #35169. This PR was taking a strong opinion on

  1. One user-facing pd.StringDtype[storage], parametrized by storage = python/pyarrow.
  2. One or two (or three) user-facing StringArray classes, an ArrowStringArray, PythonStringArray, and maybe a base StringArray.

Given the uncertainty around the API here, I'd prefer to just move forward with an internals-only ArrowStringArray and decide on the exact API later (unless we can agree on something quickly).

@simonjayhawkins
Copy link
Member

I'd prefer to just move forward with an internals-only ArrowStringArray

OK, i've just changed #35259 so that the doctest passed. I should probably revert that change and change the doctest so that ArrowStringDtype is not a top level import.

@jorisvandenbossche
Copy link
Member

Personally, I think there is enough agreement to at least start making a concrete PR, which will more easily trigger further discussion on it.
(but let's bring this up again in the issue).

@jorisvandenbossche
Copy link
Member

That said, I think it's certainly fine to focus on the actual implementation in #35259, and leave the user exposing API for a separate PR. There is already more than enough to review and discuss for the internal implementation details ;)

@simonjayhawkins
Copy link
Member

I don't seem to able to push to this branch. will open a new PR

@simonjayhawkins
Copy link
Member

I don't seem to able to push to this branch. will open a new PR

#39908

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Plan for a native string dtype
5 participants