Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: pd.StringDtype.value_counts should return pd.Int64Dtype #59346

Open
1 of 3 tasks
WillAyd opened this issue Jul 29, 2024 · 4 comments
Open
1 of 3 tasks

API: pd.StringDtype.value_counts should return pd.Int64Dtype #59346

WillAyd opened this issue Jul 29, 2024 · 4 comments
Labels
API - Consistency Internal Consistency of API/Behavior API Design Arrow pyarrow functionality Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays.

Comments

@WillAyd
Copy link
Member

WillAyd commented Jul 29, 2024

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

From discussion in https://github.com/pandas-dev/pandas/pull/59330/files#r1695250192

The fact that pd.StringDtype(storage="pyarrow") returns a int64[pyarrow] is a mistake with our API. While this is well intentioned for performance reasons, it introduces an inconsistency with our extension types

If we continue to have pd.StringDtype.value_counts return a pd.Int64Dtype() we can leave it as an implementation detail whether or not that Int64Dtype is backed by pyarrow or numpy. The user interface would not need to change at all; users could simply install (or not install) pyarrow and things should continue to work the same. This would long term also align better with PDEP-13 #58455

Feature Description

Make algorithms for the StringDtype that use pd.NA as the missing value sentinel consistently return other pandas extension types, and leave it as an implementation detail if those are backed by pyarrow or not

Alternative Solutions

status quo

Additional Context

No response

@WillAyd WillAyd added API Design ExtensionArray Extending pandas with custom dtypes or arrays. API - Consistency Internal Consistency of API/Behavior Arrow pyarrow functionality labels Jul 29, 2024
@OmarAli-Ibrahim
Copy link

assign it to me

@tan-i-ham
Copy link
Contributor

Hello @WillAyd , I am interested in this task.

I've looked at the code.

My understanding based from current conversation is that, in this test case

if dtype.na_value is np.nan:
exp_dtype = "int64"
elif dtype.storage == "pyarrow":
exp_dtype = "int64[pyarrow]"
else:
exp_dtype = "Int64"

The value will become this and continue the test, no matter the storage is "pyarrow" or not.

 if dtype.na_value is np.nan: 
     exp_dtype = "int64" 
 else: 
     exp_dtype = "Int64" 

Is the above understanding correct?

If it is correct, could you guide me to the start of possible code change?
I debug a while, however, not sure where to start to fix the code.
Is in the ArrowExtensionArray.value_counts or value_counts_internal() in pandas/core/algorithms.py? or maybe other?

@WillAyd
Copy link
Member Author

WillAyd commented Aug 10, 2024

Thanks for the interest / questions. I think should hold off on going too far in this one until the discussion around PDEP-13 gets clearer #58455

Otherwise it is hard to say what should be done here

@WillAyd WillAyd added the Dtype Conversions Unexpected or buggy dtype conversions label Aug 10, 2024
@tan-i-ham
Copy link
Contributor

Got it. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior API Design Arrow pyarrow functionality Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

No branches or pull requests

3 participants