Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Assignment of pyarrow arrays yields unexpected dtypes #56994

Open
3 tasks done
WillAyd opened this issue Jan 21, 2024 · 11 comments · May be fixed by #58601
Open
3 tasks done

BUG: Assignment of pyarrow arrays yields unexpected dtypes #56994

WillAyd opened this issue Jan 21, 2024 · 11 comments · May be fixed by #58601
Assignees
Labels
Arrow pyarrow functionality Bug Indexing Related to indexing on series/frames, not to indexes themselves pyarrow dtype retention op with pyarrow dtype -> expect pyarrow result

Comments

@WillAyd
Copy link
Member

WillAyd commented Jan 21, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import datetime
import pandas as pd
import pyarrow as pa

df = pd.DataFrame([[42]], columns=["col"])
df["int16"] = pa.array([16], type=pa.int16())
df["date"] = pa.array([datetime.date(2024, 1, 1)], type=pa.date32())
df["string"] = pa.array(["foo"], pa.string())

Issue Description

>>> df.dtypes
col               int64
int16             int16
date      datetime64[s]
string           object
dtype: object

I am surprised that the pyarrow type is not maintained during assignment

Expected Behavior

>>> df.dtypes
col               int64
int16             int16[pyarrow]
date              date32[pyarrow]
string            string[pyarrow]
dtype: object

Installed Versions

on main

@WillAyd WillAyd added Bug Needs Triage Issue that has not been reviewed by a pandas team member Indexing Related to indexing on series/frames, not to indexes themselves Arrow pyarrow functionality and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 21, 2024
@WillAyd WillAyd changed the title BUG: Assignment of pyarrow string array yields unexpected dtypes BUG: Assignment of pyarrow arrays yields unexpected dtypes Jan 21, 2024
@droussea2001
Copy link
Contributor

take

@droussea2001
Copy link
Contributor

droussea2001 commented May 13, 2024

Hi @WillAyd: I propose in PR #58601 that during a column assignment in a DataFrame, sanitize_array is called with a dtype equals to ArrowDtype(value.type) if the column value is a pa.lib._PandasConvertible (else the standard behaviour is kept)

Would it be acceptable ?

@droussea2001
Copy link
Contributor

Hello @jorisvandenbossche : if I remember well you would have some comments to add to this issue ?

@rhshadrach
Copy link
Member

We talked about this in the newcomers meeting, if I recall correctly @jorisvandenbossche wasn't certain resulting in pyarrow dtypes would necessarily be the right behavior here.

@jorisvandenbossche
Copy link
Member

Apologies for the slow response here. I indeed had the comment that we might not want to use the ArrowDtype in all cases, or at least for me it's not necessarily obvious all the time.

Let's take the example of assigning a pyarrow string array, and assume we are in pandas 3.0 that uses a string dtype by default that uses pyarrow under the hood.
Should at that point assigning a pyarrow string array create ArrowDtype(pa.string()) column or a StringDtype() column? The former might be the "closer" dtype for the data, but the latter is a default dtype (and in general, __setitem__ doesn't allow any options, in which case I think we generally should prefer default dtypes).

For Arrow types which don't have an equivalent pandas type, I think preserving the dtype definitely makes sense (eg for nested types, or decimal, etc). But for Arrow types for which we have an equivalent default dtype (and for which the conversion can often be zero-copy), it's not entirely obvious to me that we shouldn't prefer using the default dtypes.

@jorisvandenbossche
Copy link
Member

One other remark is that this is not specific to assignment / __setitem__. If we consider it there, I think we should consider it for all places were we accept array-like input data and convert that to a pandas array-like (series/index/dataframe column), to ensure such input is treated consistently?

For example, the Series constructor accepts a pyarrow array as well, but currently also coerces it to the default (numpy) dtype:

>>> pd.Series(pa.array([1,2,3]))
0    1
1    2
2    3
dtype: int64
>>> pd.Series(pa.array([1,2,3])).dtype
dtype('int64')

@rhshadrach
Copy link
Member

rhshadrach commented Jun 19, 2024

For Arrow types which don't have an equivalent pandas type, I think preserving the dtype definitely makes sense (eg for nested types, or decimal, etc). But for Arrow types for which we have an equivalent default dtype (and for which the conversion can often be zero-copy), it's not entirely obvious to me that we shouldn't prefer using the default dtypes.

Changing behavior based on whether an alternate dtype exists seems too complex to me. Even if it would be what the user wants in more cases, I think we should value simplicity of "assigning arrow arrays gives you arrow dtypes". This includes string dtypes.

@rhshadrach
Copy link
Member

For example, the Series constructor accepts a pyarrow array as well, but currently also coerces it to the default (numpy) dtype

I think this should give arrow-backed dtypes as well.

@WillAyd
Copy link
Member Author

WillAyd commented Jun 19, 2024

I agree with @rhshadrach for the near term - the rule "you get pyarrow types unless we already have something else" burdens end users with knowing all of the types that are implemented in pandas/NumPy/pyarrow and the subtle differences they may or may not have. While primitive types might be easy to do that with and strings may get easier with PDEP-14, there's also temporal types that I feel like have a lot of potential logic / expectation pitfalls if that isn't really thought out in advance.

Long term I agree these should all just map to logical types, but I think we need to agree on and solidify PDEP-13 before we start developing towards that

@droussea2001
Copy link
Contributor

droussea2001 commented Jun 20, 2024

the rule "you get pyarrow types unless we already have something else" burdens end users with knowing all of the types

IMHO, as a pandas user, I would agree with @rhshadrach and @WillAyd: if a Series or a DataFrame is created or assigned with data dtype 'X' I would expect (or at least it seems to me clearer), if it is possible, that it keeps the same 'X' dtype (even if I'm agree equally with the long term approach).

@droussea2001
Copy link
Contributor

@jorisvandenbossche : Hello Joris, hope you're doing well, would you have any news about the previous discussion ? Does the PR #58601 could answer to this problem ?

@WillAyd WillAyd added the pyarrow dtype retention op with pyarrow dtype -> expect pyarrow result label Aug 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Bug Indexing Related to indexing on series/frames, not to indexes themselves pyarrow dtype retention op with pyarrow dtype -> expect pyarrow result
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants