-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: inconsistent indices in GroupByRolling
when selecting or not selecting subset of columns
#59567
Comments
Thanks for the report - confirmed on main. Further investigations and PRs to fix are welcome! |
@rhshadrach do we simply drop the DataFrame's current index like in this case? (Note that the integer index is gone)
|
@snitish In the long-term, I think this is part of #51751. There I'm supportive of treating this as a groupby-transform (as opposed to the OP's desired behavior), but that may be a difficult change that needs to be carefully evaluated and navigated. However the additional surprise here is that subsetting the columns produces a different index. I think it'd be good to fix that narrowly. I'm supportive of adding the index from the input DataFrame. This is the behavior that seems to be documented in both: Additionally, this is also the behavior when using e.g. |
cc @mroeschke |
@rhshadrach so the two separate issues that need to be addressed are -
You're suggesting we only address no. 1 at the moment, is that right? |
Just to note that my main issue here and the reason I opened it is, as you pointed out, the inconsistency between subsetting and not subsetting. The intended behavior being a transform vs agg is less important to me as long as it's consistent in both cases :) Thanks for taking a look at this! |
Correct, but that @mroeschke purposefully introduced this behavior gives me pause. Would like his thoughts before going further on this.
I think we should not be doing so, but certainly open for discussion. |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
The only difference is that in the second case I am explicitly selecting a single column
[[predictions]]
whereas in the first example I am calling it on the full dataframe. This shouldn't make a difference as the dataframe only contains the predictions column outside of the columns used to group and roll on.This difference causes two issues in the dataframe where I don't select a subset of the columns:
Expected Behavior
I would expect both cases to behave the way the second example does, with
id, area, datetime
as the index levels.Installed Versions
INSTALLED VERSIONS
commit : d9cdd2e
python : 3.10.13.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-1066-azure
Version : #75-Ubuntu SMP Thu May 30 14:29:45 UTC 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 2.2.2
numpy : 2.1.0
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : None
pip : None
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.26.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: