-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arrow string array dtype #36142
Arrow string array dtype #36142
Conversation
Just FYI @xhochy. |
@TomAugspurger I would continue to work on this starting next week again. Do we want to coordinate somehow? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I know you said to not yet review, but was just curious ;) The dtype dispatching to two different array classes looks good to me)
@@ -459,6 +461,7 @@ def astype(self, dtype, copy=True): | |||
from pandas.core.arrays.string_ import StringDtype | |||
|
|||
dtype = pandas_dtype(dtype) | |||
# FIXME: Really hard-code here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs the general 'astype' machinery refactor to fix this, I think
========================== ============== | ||
``'string'`` global default | ||
``'string[python]'`` python | ||
``'StringDtype[python]'`` python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why allow the full class name "StringDtype" as string as well? (I don't think we do that for other dtypes?)
EDIT: ah, I see you used that as name above. But so then: why not "string" as name?
|
||
@property | ||
def name(self): | ||
return f"StringDtype[{self.storage}]" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return f"StringDtype[{self.storage}]" | |
return f"string[{self.storage}]" |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, this seems better.
|
||
Examples | ||
-------- | ||
>>> pd.array(['This is', 'some text', None, 'data.'], dtype="arrow_string") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
>>> pd.array(['This is', 'some text', None, 'data.'], dtype="arrow_string") | |
>>> pd.array(['This is', 'some text', None, 'data.'], dtype="string[arrow]") |
def _from_sequence(cls, scalars, dtype=None, copy=False): | ||
# TODO(ARROW-9407): Accept pd.NA in Arrow | ||
scalars_corrected = [None if isna(x) else x for x in scalars] | ||
return cls(pa.array(scalars_corrected, type=pa.string())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return cls(pa.array(scalars_corrected, type=pa.string())) | |
return cls(pa.array(scalars_corrected, type=pa.string(), from_pandas=True)) |
then np.nan and pd.NA will be accepted as "null" as well, in addition to None
Closing in favor of #35259 (hopefully) |
@TomAugspurger @xhochy OK to merge these changes into #35259? |
I'm not sure if we have complete agreement over how this will be exposed to the user yet in #35169. This PR was taking a strong opinion on
Given the uncertainty around the API here, I'd prefer to just move forward with an internals-only |
OK, i've just changed #35259 so that the doctest passed. I should probably revert that change and change the doctest so that ArrowStringDtype is not a top level import. |
Personally, I think there is enough agreement to at least start making a concrete PR, which will more easily trigger further discussion on it. |
That said, I think it's certainly fine to focus on the actual implementation in #35259, and leave the user exposing API for a separate PR. There is already more than enough to review and discuss for the internal implementation details ;) |
I don't seem to able to push to this branch. will open a new PR |
|
This is a continuation of #35259, updated to use a single
StringDtype
parameterized bystorage
(python or pyarrow). We do use different classes for the array though.I wouldn't recommend reviewing this yet. In particular, I need to refactor the string ops implementation. That will likely be done in a separate PR to keep things smaller here.
Closes #35169