-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Assignment of pyarrow arrays yields unexpected dtypes #56994
Comments
take |
Hello @jorisvandenbossche : if I remember well you would have some comments to add to this issue ? |
We talked about this in the newcomers meeting, if I recall correctly @jorisvandenbossche wasn't certain resulting in pyarrow dtypes would necessarily be the right behavior here. |
Apologies for the slow response here. I indeed had the comment that we might not want to use the ArrowDtype in all cases, or at least for me it's not necessarily obvious all the time. Let's take the example of assigning a pyarrow string array, and assume we are in pandas 3.0 that uses a string dtype by default that uses pyarrow under the hood. For Arrow types which don't have an equivalent pandas type, I think preserving the dtype definitely makes sense (eg for nested types, or decimal, etc). But for Arrow types for which we have an equivalent default dtype (and for which the conversion can often be zero-copy), it's not entirely obvious to me that we shouldn't prefer using the default dtypes. |
One other remark is that this is not specific to assignment / For example, the Series constructor accepts a pyarrow array as well, but currently also coerces it to the default (numpy) dtype: >>> pd.Series(pa.array([1,2,3]))
0 1
1 2
2 3
dtype: int64
>>> pd.Series(pa.array([1,2,3])).dtype
dtype('int64') |
Changing behavior based on whether an alternate dtype exists seems too complex to me. Even if it would be what the user wants in more cases, I think we should value simplicity of "assigning arrow arrays gives you arrow dtypes". This includes string dtypes. |
I think this should give arrow-backed dtypes as well. |
I agree with @rhshadrach for the near term - the rule "you get pyarrow types unless we already have something else" burdens end users with knowing all of the types that are implemented in pandas/NumPy/pyarrow and the subtle differences they may or may not have. While primitive types might be easy to do that with and strings may get easier with PDEP-14, there's also temporal types that I feel like have a lot of potential logic / expectation pitfalls if that isn't really thought out in advance. Long term I agree these should all just map to logical types, but I think we need to agree on and solidify PDEP-13 before we start developing towards that |
IMHO, as a pandas user, I would agree with @rhshadrach and @WillAyd: if a |
@jorisvandenbossche : Hello Joris, hope you're doing well, would you have any news about the previous discussion ? Does the PR #58601 could answer to this problem ? |
While I am fine with not distinguishing arrow types that map to a default dtype and those that do not for assignment, and so always create a column with ArrowDtype when assigning a pyarrow array, I am still not fully convinced this should be done for the Series constructor (although being inconsistent between assignment and Series constructor is also not great):
In #59696 I am adding a Following that same logic, the hypothetical |
I see your point on this but it feels like we are doubling down on a historical design that should be re-evaluated. That makes it way more difficult than it needs to be to use Arrow data types, which I think is what a user would be expecting calling |
The behavior of this method also seems to me to be moving in the opposite direction that we would like long-term - that it provides one more challenge for us to get to a NumPy-nullable / PyArrow world. |
Then we might disagree about the direction we would like long-term. What I want long-term is that the general user does not have to think or care about "arrow" dtypes vs other dtypes. There just happens to be a set of default logical dtypes (that can use arrow under the hood or not, depending on what we decide is the best default, like we decided to use arrow for the string dtype). Let's use the example of the string dtype, where we have both df["col"] = pa.array(["a", "b"])
# or
ser = pd.Series(pa.array(["a", "b"]) Should this give Of course this case is not in isolation, and for other dtypes there is not this default alternative backed by pyarrow. So I understand that it is annoying and confusing to have to decide on a case by case basis. But just considering the case of the string dtype on itself, I find it strange to go with the non-default dtype by default. Maybe part of the disagreement is not the long-term destination, but the road towards it. I personally see ArrowDtype as an experimental dtype (that allows users to test with and already use the native pyarrow integration), but that is not meant to become the default dtype, ever. Similar as the situation for StringDtype, assume we want to make (but I think this all comes back to the fact that we need to try resolve the logical dtypes PDEP discussion ...) |
I agree with this in the long term, but for the scope of this issue I am just thinking about the near term
Definitely a fair question, but I think you are right for the near term that this should return For everything else, we are in a spot where we don't offer a logical type abstraction and tell the user to be explicit about the storage of the type they are using. At the same time, we make it pretty hard to use Arrow storage. The I/O keyword
It just gets super verbose to have to do this. If you wanted a Series of PyArrow data, you'd have to ask for it twice. right? ser = pd.Series(pa.array([1, 2, None], type=pa.int8()), dtype=pd.ArrowDtype(pa.int8())) In some cases (like the date32 example in the OP) it is impossible to assign this data losslessly. Essentially we don't offer a logical type abstraction to encapsulate DATE data, and then we make it really hard to use the physical type
Open to moving some of this discussion back to that, as there is definitely overlap. With that said, I feel like that discussion / implementation is an effort that will take a few years / releases, assuming it comes to fruition. I still think this change in the near term is valuable, given our (admittedly less than ideal) current type system semantics |
To be clear, I think it should be StringDtype, but I was saying that I think that you think it should be ArrowDtype .. :)
So in my reasoning, this is not asking for it twice but just once. You only ask for it by specifying the A bit a similar case, but we currently also don't preserve the numpy dtype if it maps to a default dtype that is not the numpy one. At least that is how current
Yes, that's definitely fair, and indeed we need to have / agree on some workable solution in the meantime. |
While I have been quite strongly stating my preference, I also want to acknowledge that assigning a pyarrow array is not something the average user will typically do, but probably done by users that specifically want to use the ArrowDtype. So in that regard I am fine with going for the convenience of those users, if we are all aware of (and agree on) that ArrowDtype is experimental and behaviour of it can be changed at any time. |
Yea sorry I wasn't very clear. I acknowledge this is a really good point and the argument in the OP might break down here. Since we've put so much work into the 3.0 string type being logical, maybe that's the one exception here where we still coerce to the logical type? Otherwise, things might get confusing with all the new keywords / settings we have been managing
This is also a really good point. Maybe tangential, but perhaps we should read |
Does it mean that it could be ok to try something like:
|
Friendly ping @jorisvandenbossche and @WillAyd |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
I am surprised that the pyarrow type is not maintained during assignment
Expected Behavior
Installed Versions
on main
The text was updated successfully, but these errors were encountered: