-
-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Assignment of pyarrow arrays yields unexpected dtypes #56994
Comments
take |
Hello @jorisvandenbossche : if I remember well you would have some comments to add to this issue ? |
We talked about this in the newcomers meeting, if I recall correctly @jorisvandenbossche wasn't certain resulting in pyarrow dtypes would necessarily be the right behavior here. |
Apologies for the slow response here. I indeed had the comment that we might not want to use the ArrowDtype in all cases, or at least for me it's not necessarily obvious all the time. Let's take the example of assigning a pyarrow string array, and assume we are in pandas 3.0 that uses a string dtype by default that uses pyarrow under the hood. For Arrow types which don't have an equivalent pandas type, I think preserving the dtype definitely makes sense (eg for nested types, or decimal, etc). But for Arrow types for which we have an equivalent default dtype (and for which the conversion can often be zero-copy), it's not entirely obvious to me that we shouldn't prefer using the default dtypes. |
One other remark is that this is not specific to assignment / For example, the Series constructor accepts a pyarrow array as well, but currently also coerces it to the default (numpy) dtype: >>> pd.Series(pa.array([1,2,3]))
0 1
1 2
2 3
dtype: int64
>>> pd.Series(pa.array([1,2,3])).dtype
dtype('int64') |
Changing behavior based on whether an alternate dtype exists seems too complex to me. Even if it would be what the user wants in more cases, I think we should value simplicity of "assigning arrow arrays gives you arrow dtypes". This includes string dtypes. |
I think this should give arrow-backed dtypes as well. |
I agree with @rhshadrach for the near term - the rule "you get pyarrow types unless we already have something else" burdens end users with knowing all of the types that are implemented in pandas/NumPy/pyarrow and the subtle differences they may or may not have. While primitive types might be easy to do that with and strings may get easier with PDEP-14, there's also temporal types that I feel like have a lot of potential logic / expectation pitfalls if that isn't really thought out in advance. Long term I agree these should all just map to logical types, but I think we need to agree on and solidify PDEP-13 before we start developing towards that |
IMHO, as a pandas user, I would agree with @rhshadrach and @WillAyd: if a |
@jorisvandenbossche : Hello Joris, hope you're doing well, would you have any news about the previous discussion ? Does the PR #58601 could answer to this problem ? |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
I am surprised that the pyarrow type is not maintained during assignment
Expected Behavior
Installed Versions
on main
The text was updated successfully, but these errors were encountered: