Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: ArrowStringArray.__setitem__ breaks views #45419

Open
jbrockmendel opened this issue Jan 17, 2022 · 14 comments
Open

BUG: ArrowStringArray.__setitem__ breaks views #45419

jbrockmendel opened this issue Jan 17, 2022 · 14 comments
Labels
Arrow pyarrow functionality Bug ExtensionArray Extending pandas with custom dtypes or arrays. Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@jbrockmendel
Copy link
Member

jbrockmendel commented Jan 17, 2022

arr = pd.array(['foo', 'bar'], dtype='string[pyarrow]')
arr2 = arr[:]

arr2[0] = 'baz'

>>> arr
<ArrowStringArray>
['foo', 'bar']
Length: 2, dtype: string

Looks like in ArrowStringArray.__setitem__ we set self._data = ...

I expect this is a consquence of Arrow arrays (just the string arrays?) being immutable and there's likely not much we can do about it.

This seems likely to cause a long tail of headaches, might be better to raise for immutable arrays (like we do with SparseArray)

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 17, 2022
@jorisvandenbossche
Copy link
Member

I think we recently had some related discussion about this? (maybe related to the shares_memory testing) But indeed, pyarrow currently doesn't provide APIs to mutate arrays, so currently there is nothing to do about this.

This seems likely to cause a long tail of headaches, might be better to raise for immutable arrays

IMO we shouldn't do that. We can easily provide mutability on the Series/DataFrame level for the user, and pyarrow also provides some APIs to efficiently do parts of what setitem does (eg for changing values with a mask, there is pyarrow.compute.replace_with_mask)

@jbrockmendel
Copy link
Member Author

I think we recently had some related discussion about this? (maybe related to the shares_memory testing)

Yah, i think ArrowStringArray.copy() was involved.

We can easily provide mutability on the Series/DataFrame level for the user, and pyarrow also provides some APIs to efficiently do parts of what setitem does (eg for changing values with a mask, there is pyarrow.compute.replace_with_mask)

This I'll leave to you.

@mroeschke mroeschke added Arrow pyarrow functionality ExtensionArray Extending pandas with custom dtypes or arrays. Indexing Related to indexing on series/frames, not to indexes themselves and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 6, 2022
@mroeschke
Copy link
Member

mroeschke commented Mar 29, 2022

So would the options here would be:

  1. Document that ArrowExtentionArray doesn't support __setitem__ with views
  2. Raise a NotImplementedError if the user tries to do this?

@jbrockmendel
Copy link
Member Author

I dont know enough about how arrow arrays work. are they all immutable or just the string ones?

we have precedent for raising TypeError on SparseArray.__setitem__ (though there are some cases there im hoping to make work...)

Could make them immutable now (to avoid silent incorrect behavior) and then if/when CoW is implemented then revisit bc Nothing Is A View.

@mroeschke
Copy link
Member

Seems like all Arrow objects are immutable: https://arrow.apache.org/docs/cpp/arrays.html#available-strategies

As Arrow objects are immutable...

I agree avoiding silent incorrect behavior for the view case would be good, but detecting the view case I'm guessing is impractical.

Is there an internal case where we want to rely on __setitem__ with views working as intended? My gut feeling is that users might want this to Just Work i.e. document that this always return new data and views are not supported

@jbrockmendel
Copy link
Member Author

I agree avoiding silent incorrect behavior for the view case would be good, but detecting the view case I'm guessing is impractical.

Agreed we shouldn't be spending any effort trying to track/detect views.

Is there an internal case where we want to rely on __setitem__ with views working as intended?

The relevant behavior is so ubiquitous no one would bother to document # _this_ slice needs to be a view. i.e. that's an empirical question that we can't feasibly answer.

My gut feeling is that users might want this to Just Work

I think that ship has sailed since we have trouble defining what Just Work means. Presumably the arrow authors made the arrays immutable for a reason, and the hypothetical user is using a pyarrow dtype for a reason.

So I vote that immutable things are immutable. If pyarrow wants to implement an API for __setitem__ that'll be great.

@jreback
Copy link
Contributor

jreback commented Mar 30, 2022

well we could actually implement this by copying with the new values

but o suppose that should be explicit opt in

@mroeschke
Copy link
Member

Okay I'm coming around to this being immutable since the underlying data is supposed to be immutable. The ArrowStringArray is labeled as experimental so it should be okay to change this.

As suggested in https://stackoverflow.com/questions/65845694/how-to-update-data-in-pyarrow-table, we can suggest that users convert to string[python] first if they want to modify the data.

and/or maybe we can change the current __setitem__ to be a special setting method for Arrow backed arrays def set_and_copy

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Apr 6, 2022

well we could actually implement this by copying with the new values

To be clear: that's what we already do, at the array level (it only doesn't "propagate" to other, viewing arrays)


So I vote that immutable things are immutable. If pyarrow wants to implement an API for __setitem__ that'll be great.

For some context about this: the Arrow memory format itself doesn't say anything about mutable vs immutable, but it is the explicit choice of Arrow C++ (and thus pyarrow) to make the publicly exposed data structures (Array, RecordBatch) immutable, as this allows many optimizations (the underlying buffers are still mutable buffers, and in principle an advanced user can mutate arrays by mutating the buffers, if you know what you are doing, although we probably shouldn't go that way in pandas).

But, it is also not the goal of pyarrow to provide a full end-user dataframe library experience, and it is the goal to enable other projects to do that by building upon pyarrow. And it is this additional dataframe layer that can make decisions about mutability (or what kind of APIs to expose to achieve the equivalent), and that could decide to hide the immutability for the user.

Pyarrow also provides APIs to mutate data in the pyarrow.compute module (only not in-place), that are exactly there to help to implement mutating APIs (like the replace_with_mask that we already use under the hood).


Okay I'm coming around to this being immutable since the underlying data is supposed to be immutable. The ArrowStringArray is labeled as experimental so it should be okay to change this.
..
and/or maybe we can change the current __setitem__ to be a special setting method for Arrow backed arrays def set_and_copy

Personally, I think we can hide this from the user to ensure that standard workflows still work with string[pyarrow] as well. As we already do, for example with "boolean setitem":

>>> s = pd.Series(['a', 'b', 'c'], dtype="string[pyarrow]")
>>> s[s == 'a'] = 'b'
>>> s
0    b
1    b
2    c
dtype: string

(ignore for a moment that the specific example above could easily be written with replace as well)

In general, updating an object in place with setitem that way is very much standard pandas code, I think. Maybe we can/should start nudging people more towards methods that do not mutate in place, but then we will first need to develop such APIs, because for a basic setitem, there are no good method alternatives AFAIK.
The above could also be written (somewhat awkwardly) as s.where(s != 'a', 'b'), but that's also not a generic __setitem__ with boolean mask replacement (it has different semantics when setting with array values).

AFAIU, the main issue with "emulating" this mutable behaviour in pandas is that this breaks views (the original discussion here). But personally, I would not prefer to give up the basic use case of mutation (like the example above) because it breaks corner cases with views. There are other ways to deal with this aspect (like "document this difference"), and the general CoW proposal will also help here (because that will ensure this "breaking views" issue is no longer a problem at the Series/DataFrame level, only it would still give a difference in behaviour at the Array level. But that's a much more advanced case compared to DataFrame/Series, and at that level it is much easier to either document this difference or either actually let __setitem__ raise (and emulate the mutability at the Series/DataFrame level)).

@jbrockmendel
Copy link
Member Author

I know on some level that pyarrow isn't making design decisions specifically to raise my blood pressure...

@mroeschke
Copy link
Member

We could provide mutation with an explicit method (something like set_and_copy) while also having __setitem__ raise and direct to use the method to compromise?

@jreback
Copy link
Contributor

jreback commented Apr 10, 2022

we should just implement this
if it copies that's just a fact of life

@jbrockmendel
Copy link
Member Author

if it copies that's just a fact of life

That's fine at the DataFrame level, but at the array level it is asking for trouble.

pyarrow made a choice to make these immutable in order to allow optimizations. great, let's use those optimizations!

@mroeschke is right, we should make an explicit API for this that can be re-used for e.g. SparseArray.

@jreback
Copy link
Contributor

jreback commented Apr 10, 2022

ok that's fair

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Bug ExtensionArray Extending pandas with custom dtypes or arrays. Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

No branches or pull requests

4 participants