Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Behaviour of sum/mean on sparse boolean arrays changed between 1.5.3 and pandas 2.2 #58015

Open
2 of 3 tasks
CompRhys opened this issue Mar 26, 2024 · 5 comments
Open
2 of 3 tasks
Assignees
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. Reduction Operations sum, mean, min, max, etc.

Comments

@CompRhys
Copy link

CompRhys commented Mar 26, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

>>> import pandas as pd
>>> import numpy as np
>>> pd.__version__
'2.2.1'
>>> a = pd.DataFrame(np.random.randint(2, size=(3,4))).astype(pd.SparseDtype(int, fill_value=0))
>>> a
   0  1  2  3
0  0  0  1  0
1  0  1  0  1
2  0  1  1  0
dtype: Sparse[int64, 0]
>>> (a>0).sum(axis=1)
0    True
1    True
2    True
dtype: Sparse[bool, False]
>>> b = pd.DataFrame(np.random.randint(2, size=(3,4)))
>>> (b>0).sum(axis=1)
0    3
1    4
2    2
dtype: int64
>>> import pandas as pd
>>> import numpy as np
>>> pd.__version__
'1.5.3'
>>> a = pd.DataFrame(np.random.randint(2, size=(3,4))).astype(pd.SparseDtype(int, fill_value=0))
>>> a
   0  1  2  3
0  1  1  0  0
1  0  0  1  0
2  0  0  1  1
>>> (a>0).sum(axis=1)
0    2
1    1
2    2
dtype: int64
>>> b = pd.DataFrame(np.random.randint(2, size=(3,4)))
>>> (b>0).sum(axis=1)
0    1
1    4
2    1
dtype: int64


### Issue Description

The sum of a sparse boolean array is sparse boolean rather than int.

### Expected Behavior

I would expect the sum of a sparse boolean array to be an int in order to match the behavior on a dense array.

### Installed Versions

this issue is observed swapping from 1.5.3 to 2.2.1
@CompRhys CompRhys added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 26, 2024
@CompRhys
Copy link
Author

potentially this is an edge case of https://pandas.pydata.org/docs/dev/whatsnew/v2.1.0.html#dataframe-reductions-preserve-extension-dtypes, it is also could be intended behavior but it does seem very counter-intuitive to me

@rhshadrach
Copy link
Member

Thanks for the report - having the result be Sparse[bool] does look incorrect to me, I would think it should be Sparse[int]. Another potential way this may have changed is #54341, need to run a git bisect to tell.

Further investigations and PRs to fix are welcome!

@rhshadrach rhshadrach added Reduction Operations sum, mean, min, max, etc. Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Mar 26, 2024
@dontgoto
Copy link
Contributor

take

@dontgoto
Copy link
Contributor

I would take a look at this issue if that's ok for you @CompRhys

@CompRhys
Copy link
Author

I wouldn't know where to start in the internals so am very grateful if you would like to tackle it! @dontgoto

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays. Reduction Operations sum, mean, min, max, etc.
Projects
None yet
Development

No branches or pull requests

3 participants