-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: ArrowStringArray.__setitem__ breaks views #45419
Comments
I think we recently had some related discussion about this? (maybe related to the shares_memory testing) But indeed, pyarrow currently doesn't provide APIs to mutate arrays, so currently there is nothing to do about this.
IMO we shouldn't do that. We can easily provide mutability on the Series/DataFrame level for the user, and pyarrow also provides some APIs to efficiently do parts of what |
Yah, i think ArrowStringArray.copy() was involved.
This I'll leave to you. |
So would the options here would be:
|
I dont know enough about how arrow arrays work. are they all immutable or just the string ones? we have precedent for raising TypeError on Could make them immutable now (to avoid silent incorrect behavior) and then if/when CoW is implemented then revisit bc Nothing Is A View. |
Seems like all Arrow objects are immutable: https://arrow.apache.org/docs/cpp/arrays.html#available-strategies
I agree avoiding silent incorrect behavior for the view case would be good, but detecting the view case I'm guessing is impractical. Is there an internal case where we want to rely on |
Agreed we shouldn't be spending any effort trying to track/detect views.
The relevant behavior is so ubiquitous no one would bother to document
I think that ship has sailed since we have trouble defining what Just Work means. Presumably the arrow authors made the arrays immutable for a reason, and the hypothetical user is using a pyarrow dtype for a reason. So I vote that immutable things are immutable. If pyarrow wants to implement an API for |
well we could actually implement this by copying with the new values but o suppose that should be explicit opt in |
Okay I'm coming around to this being immutable since the underlying data is supposed to be immutable. The ArrowStringArray is labeled as experimental so it should be okay to change this. As suggested in https://stackoverflow.com/questions/65845694/how-to-update-data-in-pyarrow-table, we can suggest that users convert to and/or maybe we can change the current |
To be clear: that's what we already do, at the array level (it only doesn't "propagate" to other, viewing arrays)
For some context about this: the Arrow memory format itself doesn't say anything about mutable vs immutable, but it is the explicit choice of Arrow C++ (and thus pyarrow) to make the publicly exposed data structures (Array, RecordBatch) immutable, as this allows many optimizations (the underlying buffers are still mutable buffers, and in principle an advanced user can mutate arrays by mutating the buffers, if you know what you are doing, although we probably shouldn't go that way in pandas). But, it is also not the goal of pyarrow to provide a full end-user dataframe library experience, and it is the goal to enable other projects to do that by building upon pyarrow. And it is this additional dataframe layer that can make decisions about mutability (or what kind of APIs to expose to achieve the equivalent), and that could decide to hide the immutability for the user. Pyarrow also provides APIs to mutate data in the
Personally, I think we can hide this from the user to ensure that standard workflows still work with >>> s = pd.Series(['a', 'b', 'c'], dtype="string[pyarrow]")
>>> s[s == 'a'] = 'b'
>>> s
0 b
1 b
2 c
dtype: string (ignore for a moment that the specific example above could easily be written with In general, updating an object in place with setitem that way is very much standard pandas code, I think. Maybe we can/should start nudging people more towards methods that do not mutate in place, but then we will first need to develop such APIs, because for a basic setitem, there are no good method alternatives AFAIK. AFAIU, the main issue with "emulating" this mutable behaviour in pandas is that this breaks views (the original discussion here). But personally, I would not prefer to give up the basic use case of mutation (like the example above) because it breaks corner cases with views. There are other ways to deal with this aspect (like "document this difference"), and the general CoW proposal will also help here (because that will ensure this "breaking views" issue is no longer a problem at the Series/DataFrame level, only it would still give a difference in behaviour at the Array level. But that's a much more advanced case compared to DataFrame/Series, and at that level it is much easier to either document this difference or either actually let |
I know on some level that pyarrow isn't making design decisions specifically to raise my blood pressure... |
We could provide mutation with an explicit method (something like |
we should just implement this |
That's fine at the DataFrame level, but at the array level it is asking for trouble. pyarrow made a choice to make these immutable in order to allow optimizations. great, let's use those optimizations! @mroeschke is right, we should make an explicit API for this that can be re-used for e.g. SparseArray. |
ok that's fair |
Looks like in
ArrowStringArray.__setitem__
we setself._data = ...
I expect this is a consquence of Arrow arrays (just the string arrays?) being immutable and there's likely not much we can do about it.
This seems likely to cause a long tail of headaches, might be better to raise for immutable arrays (like we do with SparseArray)
The text was updated successfully, but these errors were encountered: