-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plan for a native string dtype #35169
Comments
why make this complicated? I would just make arrow an import of StringArray and call it a day, its already experimental. Then bump the version of arrow required as needed. we did exactly this with parquet. |
Points that come to mind having work a bit on that in
Also noting here: I have spent quite some time (sadly not sufficient time) prototyping things around this in |
Also note that I track the algorithm coverage of the current |
That's the simplest way from our end. Are we willing to require arrow to opt into the new string dtype? Thanks for that list @xhochy, that's extremely valuable. In particular
|
Regarding the immutability I posted an explanation on that in #8640 (comment) if a reader of this thread is unclear why we aren't just making a mutable type here instead. |
@TomAugspurger thanks for getting this discussion started!
I would personally not prefer to tie the use of "string" dtype to the presence of pyarrow. If pyarrow is optional (and I personally prefer to keep it that way for now), I would prefer keeping it optional for the new dtypes as well.
As long as pyarrow is optional, I think we need to keep the "old" implementation around (related to the above). But I agree we probably don't want a |
As in general with our new nullable dtypes, operations should as much as possible also return nullable dtypes (so eg the nullable boolean dtype). For List that's of course not yet possible.
@xhochy based on that experience, what's your current take on the Array vs ChunkedArray point your raise above? |
"I have no idea anymore". From an Arrow standpoint, |
We discussed this a bit yesterday on the call. Will try to summarize some of the points that were brought up here (based on our notes and my recollection) with some additional background:
|
As a user, this is what has always worried me most, since we started talking about arrow as a backend for strings. But I think asking users to choose between mutability and efficiency would be acceptable - as opposed to (I think) making efficient, non-mutable strings the default and (definitely) entirely replacing the current mutable type with one that isn't. |
As this will probably need more than the string algorithms that are exposed by the I plan to start a PR with the basic scaffolding for the data type next week. |
Quite early but if someone wants to follow progress, I opened a draft PR: #35259 |
Thanks for all the discussion here.
Agreed with this. I think that the downsides of the Arrow implementation (fundamental: mutability, temporary: not all algos implemented) means that we'll want to keep the current
Fair point. Agreed that we can avoid that at least for now. @jorisvandenbossche suggested a parameter like |
@xhochy do you have thoughts on using the dtype to specify how the values should be stored? It's a bit unusual for that to be an attribute of the dtype rather than the array, but I think that fits a bit better with the rest of pandas ( Alternatively we give up on the goal of having a single user-facing |
Did you mean to say "just have StringDtype and ArrowStringDtype" ? Because we could still have two array classes (but eg subclassing the same base StringArray class), which are coupled to a single, parametrized StringDtype. |
I'm not sure exactly what I had in my head, but having separate array classes seems fine. |
An interesting approach to allow both performance and mutability here would be to pad the UTF-8 strings to their maximum possible length. Even considering the memory consumption hit from this, I think it will still be a big improvement over the pure python implementation in terms of memory consumption and performance. In this way if a character goes from 1 byte to 6 bytes say it's not a problem as we would have already allocated enough room for the change to be possible via the padding. (And would avoid having to reallocate the whole array for any small changes as there's always enough space to accommodate the new value) I imagine that this could require work on the Arrow side of things, do you think this is worth pursuing? |
This is similar (but more extreme) to the NumPy approach and won't work efficiently due to two reasons:
|
I think my previous thought was ill defined i'll clarify it a bit. So if someone knew that their strings would never be more than say 255 characters, as for instance it came from a database, they could use that information to allocate only the space they needed for their strings, with minimal padding and allow minimal cache misses, while retaining mutability. In future if the user specified a specific subset of UTF-8 eg "ASCII only" then it should be possible to further reduce the padding and memory requirements. edit: Obviously nothing is perfect but this does seem like a useful compromise in many cases where mutability without reallocating an array would be helpful. I would argue that some padding overhead is still cheaper than repeated reallocation of a large array. |
The PR of @TomAugspurger at #36142 proposed a way to expose this "native/arrow-backed string dtype" to the the user. Specifically, that PR was doing:
So for now, "string" dtype would still default to Are people generally OK with this as a way to provide this as an opt-in on the short term? |
Sounds good as this gives the end-user still and easy and obvious way to switch between both implementations. |
Sounds good to me too. I expect that the class hierarchy will be something like class StringArray(ExtensionArray):
...
class ArrowStringArray(StringArray):
...
class PythonStringArray(StringArray, PandasArray):
... and trying to make a |
We had a quick chat with @simonjayhawkins and @TomAugspurger, and the rough next steps that can be separated we see are:
|
Other follow-ups needed after #35259
|
One aspect that @simonjayhawkins raised while finalizing the implementation in #39908 (#39908 (comment)) is the use of a parametrized dtype ( Bringing that up here, because it's a more fundamental issue (and not just technical implementaiton detail of the PR) that relates to some of the things discussed above. For example, @xhochy argued above for not (only) having a global option to control which storage backend would be used, so you can decide per-column basis whether to use Arrow. Personally, because using Arrow or not can still have some important user facing consequence (eg regarding mutability / overhead when mutating the Arrow-based storage), I think it is useful to make this option easily available to the user when needed (so eg with the |
Agreed with this. |
There's a few followup items to discuss, but I think this issue can be closed with pandas 1.3.0 released. Thanks all! |
Apache Arrow has support for natively storing UTF-8 data. And work is ongoing
adding kernels (e.g.
str.isupper()
) to operate on that data. This issueis to discuss how we can expose the native string dtype to pandas' users.
There are several things to discuss:
How do users opt into Arrow-backed StringArray?
The primary difficulty is the additional Arrow dependency. I'm assuming that we
are not ready to adopt it as a required dependency, so all of this will be
opt-in for now (though this point is open for discussion).
StringArray is marked as experimental, so our usual API-breaking restrictions
rules don't apply. But we want to do this in a way that's not too disruptive.
There are three was to get a
StringDtype
-dtype array today:pd.array(['a', 'b', None])
dtype=pd.StringDtype()
dtype="string"
My preference is for all of these to stay consistent. They all either give a
StringArray backed by an object-dtype ndarray or a StringArray backed by Arrow
memory.
I also have a preference for not keeping around our old implementation for too
long. So I don't think we want something like
pd.PythonStringDtype()
as away to get the StringArray backed by an object-dtype ndarray.
The easiest way to support this is, I think, an option.
Then all of those would create an Arrow-backed StringArray.
Fallback Mode
It's likely that Arrow 1.0 will not implement all the string kernels we need.
So when someone does
we have a few options:
I'm not sure which is best. My preference for now is probably to raise, but I could see doing either.
The text was updated successfully, but these errors were encountered: