-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Groupby-aggregate on a boolean column returns a different datatype with pyarrow than with numpy #53030
Comments
Thanks for the report. Confirmed on main, further investigations and PRs to fix are welcome! |
Thanks for the quick response. I'm not familiar enough with the pandas code base (and in particular with whatever's going on with Arrow) to pursue this further, but it does seem like it has potential to surprise a fair number of users. This kind of aggregation is not uncommon. |
take |
So this ends up here we hit the |
@mroeschke not sure how to fix this.. so unassigned myself.. Sorry for the inconvenience... |
This is another good issue to track for PDEP-13 #58455 |
@WillAyd - Would I be right to assume that applies to any issue tagged with |
Yea I think many of that tag and the |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
Doing a groupby and aggregation on a
bool[pyarrow]
column returns a different datatype than the same operation on a numpybool
column. In particular, it seems to always return anotherbool[pyarrow]
regardless of the aggregation performed.Expected Behavior
I would expect the same datatype and results to be returned regardless of the backend chosen. Specifically, I would expect the result for category
'A'
to be the same as the result of the following calculation, which is the same regardless of backend:OR, if this is the intended behavior, I would expect this change to be prominently displayed in the
groupby
documentation.Installed Versions
pandas : 2.0.1
numpy : 1.23.5
pytz : 2022.7.1
dateutil : 2.8.2
setuptools : 57.5.0
pip : 23.0.1
Cython : 0.29.33
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.10.0
pandas_datareader: None
bs4 : 4.11.2
bottleneck : None
brotli : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : 3.7.0
numba : 0.56.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.10.1
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : 2023.1.0
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : 2.3.0
pyqt5 : None
The text was updated successfully, but these errors were encountered: