Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Assignment of pyarrow arrays yields unexpected dtypes #56994

Open
3 tasks done
WillAyd opened this issue Jan 21, 2024 · 21 comments · May be fixed by #58601
Open
3 tasks done

BUG: Assignment of pyarrow arrays yields unexpected dtypes #56994

WillAyd opened this issue Jan 21, 2024 · 21 comments · May be fixed by #58601
Assignees
Labels
Arrow pyarrow functionality Bug Indexing Related to indexing on series/frames, not to indexes themselves pyarrow dtype retention op with pyarrow dtype -> expect pyarrow result

Comments

@WillAyd
Copy link
Member

WillAyd commented Jan 21, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import datetime
import pandas as pd
import pyarrow as pa

df = pd.DataFrame([[42]], columns=["col"])
df["int16"] = pa.array([16], type=pa.int16())
df["date"] = pa.array([datetime.date(2024, 1, 1)], type=pa.date32())
df["string"] = pa.array(["foo"], pa.string())

Issue Description

>>> df.dtypes
col               int64
int16             int16
date      datetime64[s]
string           object
dtype: object

I am surprised that the pyarrow type is not maintained during assignment

Expected Behavior

>>> df.dtypes
col               int64
int16             int16[pyarrow]
date              date32[pyarrow]
string            string[pyarrow]
dtype: object

Installed Versions

on main

@WillAyd WillAyd added Bug Needs Triage Issue that has not been reviewed by a pandas team member Indexing Related to indexing on series/frames, not to indexes themselves Arrow pyarrow functionality and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 21, 2024
@WillAyd WillAyd changed the title BUG: Assignment of pyarrow string array yields unexpected dtypes BUG: Assignment of pyarrow arrays yields unexpected dtypes Jan 21, 2024
@droussea2001
Copy link
Contributor

take

@droussea2001
Copy link
Contributor

droussea2001 commented May 13, 2024

Hi @WillAyd: I propose in PR #58601 that during a column assignment in a DataFrame, sanitize_array is called with a dtype equals to ArrowDtype(value.type) if the column value is a pa.lib._PandasConvertible (else the standard behaviour is kept)

Would it be acceptable ?

@droussea2001
Copy link
Contributor

Hello @jorisvandenbossche : if I remember well you would have some comments to add to this issue ?

@rhshadrach
Copy link
Member

We talked about this in the newcomers meeting, if I recall correctly @jorisvandenbossche wasn't certain resulting in pyarrow dtypes would necessarily be the right behavior here.

@jorisvandenbossche
Copy link
Member

Apologies for the slow response here. I indeed had the comment that we might not want to use the ArrowDtype in all cases, or at least for me it's not necessarily obvious all the time.

Let's take the example of assigning a pyarrow string array, and assume we are in pandas 3.0 that uses a string dtype by default that uses pyarrow under the hood.
Should at that point assigning a pyarrow string array create ArrowDtype(pa.string()) column or a StringDtype() column? The former might be the "closer" dtype for the data, but the latter is a default dtype (and in general, __setitem__ doesn't allow any options, in which case I think we generally should prefer default dtypes).

For Arrow types which don't have an equivalent pandas type, I think preserving the dtype definitely makes sense (eg for nested types, or decimal, etc). But for Arrow types for which we have an equivalent default dtype (and for which the conversion can often be zero-copy), it's not entirely obvious to me that we shouldn't prefer using the default dtypes.

@jorisvandenbossche
Copy link
Member

One other remark is that this is not specific to assignment / __setitem__. If we consider it there, I think we should consider it for all places were we accept array-like input data and convert that to a pandas array-like (series/index/dataframe column), to ensure such input is treated consistently?

For example, the Series constructor accepts a pyarrow array as well, but currently also coerces it to the default (numpy) dtype:

>>> pd.Series(pa.array([1,2,3]))
0    1
1    2
2    3
dtype: int64
>>> pd.Series(pa.array([1,2,3])).dtype
dtype('int64')

@rhshadrach
Copy link
Member

rhshadrach commented Jun 19, 2024

For Arrow types which don't have an equivalent pandas type, I think preserving the dtype definitely makes sense (eg for nested types, or decimal, etc). But for Arrow types for which we have an equivalent default dtype (and for which the conversion can often be zero-copy), it's not entirely obvious to me that we shouldn't prefer using the default dtypes.

Changing behavior based on whether an alternate dtype exists seems too complex to me. Even if it would be what the user wants in more cases, I think we should value simplicity of "assigning arrow arrays gives you arrow dtypes". This includes string dtypes.

@rhshadrach
Copy link
Member

For example, the Series constructor accepts a pyarrow array as well, but currently also coerces it to the default (numpy) dtype

I think this should give arrow-backed dtypes as well.

@WillAyd
Copy link
Member Author

WillAyd commented Jun 19, 2024

I agree with @rhshadrach for the near term - the rule "you get pyarrow types unless we already have something else" burdens end users with knowing all of the types that are implemented in pandas/NumPy/pyarrow and the subtle differences they may or may not have. While primitive types might be easy to do that with and strings may get easier with PDEP-14, there's also temporal types that I feel like have a lot of potential logic / expectation pitfalls if that isn't really thought out in advance.

Long term I agree these should all just map to logical types, but I think we need to agree on and solidify PDEP-13 before we start developing towards that

@droussea2001
Copy link
Contributor

droussea2001 commented Jun 20, 2024

the rule "you get pyarrow types unless we already have something else" burdens end users with knowing all of the types

IMHO, as a pandas user, I would agree with @rhshadrach and @WillAyd: if a Series or a DataFrame is created or assigned with data dtype 'X' I would expect (or at least it seems to me clearer), if it is possible, that it keeps the same 'X' dtype (even if I'm agree equally with the long term approach).

@droussea2001
Copy link
Contributor

@jorisvandenbossche : Hello Joris, hope you're doing well, would you have any news about the previous discussion ? Does the PR #58601 could answer to this problem ?

@WillAyd WillAyd added the pyarrow dtype retention op with pyarrow dtype -> expect pyarrow result label Aug 17, 2024
@jorisvandenbossche
Copy link
Member

While I am fine with not distinguishing arrow types that map to a default dtype and those that do not for assignment, and so always create a column with ArrowDtype when assigning a pyarrow array, I am still not fully convinced this should be done for the Series constructor (although being inconsistent between assignment and Series constructor is also not great):

For example, the Series constructor accepts a pyarrow array as well, but currently also coerces it to the default (numpy) dtype

I think this should give arrow-backed dtypes as well.

In #59696 I am adding a DataFrame.from_arrow() constructor method (I need to update that PR, and reviews also very welcome). That is currently only for DataFrame, but probably we should add a Series.from_arrow as follow-up as well. And at the moment, this method also converts the Arrow types to "default" dtypes on the pandas side (essentially following pyarrow's to_pandas' defaults).

Following that same logic, the hypothetical Series.from_arrow(pyarrow_array) and Series(pyarrow_array) would return a different result, which would also be strange? Or should we consider changing the behaviour of from_arrow()? (but given this method is meant as a generic constructor of a pandas dataframe/series from any arrow-compatible source, I personally think this should use default dtypes by default)

@WillAyd
Copy link
Member Author

WillAyd commented Nov 4, 2024

And at the moment, this method also converts the Arrow types to "default" dtypes on the pandas side (essentially following pyarrow's to_pandas' defaults).

I see your point on this but it feels like we are doubling down on a historical design that should be re-evaluated. That makes it way more difficult than it needs to be to use Arrow data types, which I think is what a user would be expecting calling from_arrow or assigning an arrow array

@rhshadrach
Copy link
Member

rhshadrach commented Nov 5, 2024

And at the moment, this method also converts the Arrow types to "default" dtypes on the pandas side (essentially following pyarrow's to_pandas' defaults).

The behavior of this method also seems to me to be moving in the opposite direction that we would like long-term - that it provides one more challenge for us to get to a NumPy-nullable / PyArrow world.

@jorisvandenbossche
Copy link
Member

Then we might disagree about the direction we would like long-term. What I want long-term is that the general user does not have to think or care about "arrow" dtypes vs other dtypes. There just happens to be a set of default logical dtypes (that can use arrow under the hood or not, depending on what we decide is the best default, like we decided to use arrow for the string dtype).
And at that point, I would expect that creating a Series or column (through assignment) will use the default dtype, unless the user explicitly specifies the pandas dtype (i.e. Series(..., dtype=<explicit dtype instance>)).

Let's use the example of the string dtype, where we have both StringDtype("pyarrow") and ArrowDtype(pa.string()), both backed by a pyarrow string array. Then when a user does:

df["col"] = pa.array(["a", "b"])
# or 
ser = pd.Series(pa.array(["a", "b"])

Should this give StringDtype("pyarrow") or ArrowDtype(pa.string())?
My understanding is that you are arguing for using ArrowDtype(pa.string()). But why would we (for this specific case) prefer the experimental ArrowDtype over the default (since this change is for 3.0) StringDtype, while both are backed by pyarrow?

Of course this case is not in isolation, and for other dtypes there is not this default alternative backed by pyarrow. So I understand that it is annoying and confusing to have to decide on a case by case basis. But just considering the case of the string dtype on itself, I find it strange to go with the non-default dtype by default.

Maybe part of the disagreement is not the long-term destination, but the road towards it. I personally see ArrowDtype as an experimental dtype (that allows users to test with and already use the native pyarrow integration), but that is not meant to become the default dtype, ever. Similar as the situation for StringDtype, assume we want to make ArrowDtype(pa.int64()) the default integer dtype at some point, then I expect that we will create an pd.Int64Dtype(storage="pyarrow") to become that default dtype, and not the ArrowDtype itself (while under the hood using the ArrowEA implementation). And in that light, I am hesitant to construct pandas data with ArrowDtype by default (when it was not explicitly requested through some keyword).

(but I think this all comes back to the fact that we need to try resolve the logical dtypes PDEP discussion ...)

@WillAyd
Copy link
Member Author

WillAyd commented Nov 5, 2024

What I want long-term is that the general user does not have to think or care about "arrow" dtypes vs other dtypes.

I agree with this in the long term, but for the scope of this issue I am just thinking about the near term

My understanding is that you are arguing for using ArrowDtype(pa.string()). But why would we (for this specific case) prefer the experimental ArrowDtype over the default (since this change is for 3.0) StringDtype, while both are backed by pyarrow?

Definitely a fair question, but I think you are right for the near term that this should return ArrowDtype(pa.string()), OR if we make an exception for anything we can continue to treat strings as our sole logical type, since we kind of have it developed that way already

For everything else, we are in a spot where we don't offer a logical type abstraction and tell the user to be explicit about the storage of the type they are using. At the same time, we make it pretty hard to use Arrow storage. The I/O keyword dtype_backend="pyarrow" will return arrow storage in a lot of cases (I think?) but direct assignment of a pyarrow array ends up converting to a NumPy array. I can't imagine the latter behavior being desirable at all

And in that light, I am hesitant to construct pandas data with ArrowDtype by default (when it was not explicitly requested through some keyword).

It just gets super verbose to have to do this. If you wanted a Series of PyArrow data, you'd have to ask for it twice. right?

ser = pd.Series(pa.array([1, 2, None], type=pa.int8()), dtype=pd.ArrowDtype(pa.int8()))

In some cases (like the date32 example in the OP) it is impossible to assign this data losslessly. Essentially we don't offer a logical type abstraction to encapsulate DATE data, and then we make it really hard to use the physical type

(but I think this all comes back to the fact that we need to try resolve the logical dtypes PDEP discussion ...)

Open to moving some of this discussion back to that, as there is definitely overlap. With that said, I feel like that discussion / implementation is an effort that will take a few years / releases, assuming it comes to fruition. I still think this change in the near term is valuable, given our (admittedly less than ideal) current type system semantics

@jorisvandenbossche
Copy link
Member

but I think you are right for the near term that this should return ArrowDtype(pa.string()),

To be clear, I think it should be StringDtype, but I was saying that I think that you think it should be ArrowDtype .. :)

It just gets super verbose to have to do this. If you wanted a Series of PyArrow data, you'd have to ask for it twice. right?

ser = pd.Series(pa.array([1, 2, None], type=pa.int8()), dtype=pd.ArrowDtype(pa.int8()))

So in my reasoning, this is not asking for it twice but just once. You only ask for it by specifying the dtype, and not by passing a pyarrow array.
I know that a pyarrow array is a bit of a special case because we use pyarrow explicitly for the implementation of the ArrowDtype. But what if you pass another arrow-compatible array object (that implements __arrow_c_array__, which we should start recognizing), like a polars Series or arro3 or nanoarrow array?
I personally think we should see a pyarrow array as "just" another array-like we accept as input, and that we coerce to our internal (default) representation.

A bit a similar case, but we currently also don't preserve the numpy dtype if it maps to a default dtype that is not the numpy one. At least that is how current main behaves for StringDtype and assigning numpy strings.
Also for this case (constructing from or assigning numpy arrays) we will have to decide what to do in the future when adding more default dtypes based on pyarrow (e.g. assume the nullable Int64 becomes the default integer dtype, then assigning a numpy int64 array will also convert that to a masked IntegerArray).

With that said, I feel like that discussion / implementation (logical dtypes PDEP) is an effort that will take a few years / releases, assuming it comes to fruition. I still think this change in the near term is valuable, given our (admittedly less than ideal) current type system semantics

Yes, that's definitely fair, and indeed we need to have / agree on some workable solution in the meantime.

@jorisvandenbossche
Copy link
Member

While I have been quite strongly stating my preference, I also want to acknowledge that assigning a pyarrow array is not something the average user will typically do, but probably done by users that specifically want to use the ArrowDtype. So in that regard I am fine with going for the convenience of those users, if we are all aware of (and agree on) that ArrowDtype is experimental and behaviour of it can be changed at any time.
(mostly to say, I don't want that we are tied to creating ArrowDtype when assigning a pyarrow array, when we would move forward with better logical types. At that point we should feel free to break this to use those new dtypes)

@WillAyd
Copy link
Member Author

WillAyd commented Nov 5, 2024

To be clear, I think it should be StringDtype, but I was saying that I think that you think it should be ArrowDtype .. :)

Yea sorry I wasn't very clear. I acknowledge this is a really good point and the argument in the OP might break down here.

Since we've put so much work into the 3.0 string type being logical, maybe that's the one exception here where we still coerce to the logical type? Otherwise, things might get confusing with all the new keywords / settings we have been managing

But what if you pass another arrow-compatible array object (that implements __arrow_c_array__, which we should start recognizing), like a polars Series or arro3 or nanoarrow array?

This is also a really good point. Maybe tangential, but perhaps we should read __arrow_c_stream__ or __arrow_c_schema__ in that case to determine the type? I think that's an implementation detail we can work out

@droussea2001
Copy link
Contributor

Does it mean that it could be ok to try something like:

  • Creating a function called for example, is_arrow_compatible_array (in lib.pyx)
  • Checking, in this function, if the array in parameter contains a valid __arrow_c_stream__ or __arrow_c_schema__ method (like in test_series_arrow_interface() for example)
  • Replacing the call of is_pyarrow_array by is_arrow_compatible_array ?

@droussea2001
Copy link
Contributor

Friendly ping @jorisvandenbossche and @WillAyd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Bug Indexing Related to indexing on series/frames, not to indexes themselves pyarrow dtype retention op with pyarrow dtype -> expect pyarrow result
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants