Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: inconsistent indices in GroupByRolling when selecting or not selecting subset of columns #59567

Open
2 of 3 tasks
inigohidalgo opened this issue Aug 20, 2024 · 7 comments
Labels
API - Consistency Internal Consistency of API/Behavior Bug Groupby Window rolling, ewma, expanding

Comments

@inigohidalgo
Copy link

inigohidalgo commented Aug 20, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np
datetime_column = "datetime"
datetime_series = pd.date_range(start="2020-01-01", periods=10, freq="D")
datetime_series = datetime_series.append(datetime_series)

predictions = pd.DataFrame(
    {
        datetime_column: datetime_series,
        "prediction": np.random.rand(len(datetime_series)),
        "id": np.repeat(["A", "B"], 10),
        "area": np.repeat(["fr", "fr", "de", "de", "fr"], 4),
    }
)
print(predictions.groupby(["id", "area"]).rolling("7d", on="datetime").max())
             datetime  prediction
id area                          
A  de   8  2020-01-09    0.768346
        9  2020-01-10    0.768346
   fr   0  2020-01-01    0.159567
        1  2020-01-02    0.722039
        2  2020-01-03    0.722039
        3  2020-01-04    0.922641
        4  2020-01-05    0.922641
        5  2020-01-06    0.922641
        6  2020-01-07    0.922641
        7  2020-01-08    0.922641
B  de   10 2020-01-01    0.158251
        11 2020-01-02    0.814331
        12 2020-01-03    0.814331
        13 2020-01-04    0.814331
        14 2020-01-05    0.814331
        15 2020-01-06    0.943016
   fr   16 2020-01-07    0.975385
        17 2020-01-08    0.975385
        18 2020-01-09    0.975385
        19 2020-01-10    0.975385
print(predictions.groupby(["id", "area"]).rolling("7d", on="datetime")[["prediction"]].max())
                    prediction
id area datetime              
A  de   2020-01-09    0.768346
        2020-01-10    0.768346
   fr   2020-01-01    0.159567
        2020-01-02    0.722039
        2020-01-03    0.722039
        2020-01-04    0.922641
        2020-01-05    0.922641
        2020-01-06    0.922641
        2020-01-07    0.922641
        2020-01-08    0.922641
B  de   2020-01-01    0.158251
        2020-01-02    0.814331
        2020-01-03    0.814331
        2020-01-04    0.814331
        2020-01-05    0.814331
        2020-01-06    0.943016
   fr   2020-01-07    0.975385
        2020-01-08    0.975385
        2020-01-09    0.975385
        2020-01-10    0.975385

Issue Description

The only difference is that in the second case I am explicitly selecting a single column [[predictions]] whereas in the first example I am calling it on the full dataframe. This shouldn't make a difference as the dataframe only contains the predictions column outside of the columns used to group and roll on.

This difference causes two issues in the dataframe where I don't select a subset of the columns:

  1. The old index is appended as an additional unnamed level
  2. The datetime column is kept as a column instead of as an index level

Expected Behavior

I would expect both cases to behave the way the second example does, with id, area, datetime as the index levels.

Installed Versions

INSTALLED VERSIONS

commit : d9cdd2e
python : 3.10.13.final.0
python-bits : 64
OS : Linux
OS-release : 5.15.0-1066-azure
Version : #75-Ubuntu SMP Thu May 30 14:29:45 UTC 2024
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.2
numpy : 2.1.0
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : None
pip : None
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : 8.26.0
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None

@inigohidalgo inigohidalgo added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 20, 2024
@rhshadrach
Copy link
Member

Thanks for the report - confirmed on main. Further investigations and PRs to fix are welcome!

@rhshadrach rhshadrach added Groupby Window rolling, ewma, expanding and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Aug 25, 2024
@snitish
Copy link
Contributor

snitish commented Dec 8, 2024

@rhshadrach do we simply drop the DataFrame's current index like in this case? (Note that the integer index is gone)

>>> print(predictions.groupby(["id", "area"]).rolling("7d", on="datetime")[["prediction"]].max())
                    prediction
id area datetime              
A  de   2020-01-09    0.768346
        2020-01-10    0.768346
   fr   2020-01-01    0.159567
        2020-01-02    0.722039
        2020-01-03    0.722039
        2020-01-04    0.922641
        2020-01-05    0.922641
        2020-01-06    0.922641
        2020-01-07    0.922641
        2020-01-08    0.922641
B  de   2020-01-01    0.158251
        2020-01-02    0.814331
        2020-01-03    0.814331
        2020-01-04    0.814331
        2020-01-05    0.814331
        2020-01-06    0.943016
   fr   2020-01-07    0.975385
        2020-01-08    0.975385
        2020-01-09    0.975385
        2020-01-10    0.975385

@rhshadrach
Copy link
Member

@snitish In the long-term, I think this is part of #51751. There I'm supportive of treating this as a groupby-transform (as opposed to the OP's desired behavior), but that may be a difficult change that needs to be carefully evaluated and navigated.

However the additional surprise here is that subsetting the columns produces a different index. I think it'd be good to fix that narrowly. I'm supportive of adding the index from the input DataFrame. This is the behavior that seems to be documented in both:

Additionally, this is also the behavior when using e.g. .rolling(3) instead of .rolling(on=...).

@rhshadrach rhshadrach added the API - Consistency Internal Consistency of API/Behavior label Dec 9, 2024
@rhshadrach
Copy link
Member

cc @mroeschke

@snitish
Copy link
Contributor

snitish commented Dec 10, 2024

@rhshadrach so the two separate issues that need to be addressed are -

  1. Subsetting the groupby to specific columns produces a different index (seems due to BUG: rolling.groupby with on and __getitem__ doesn't mutate underlying object #43374)
  2. The open question of whether the 'on' column needs to be added as a new level to the multiindex

You're suggesting we only address no. 1 at the moment, is that right?

@inigohidalgo
Copy link
Author

inigohidalgo commented Dec 10, 2024

@rhshadrach @snitish

I'm supportive of treating this as a groupby-transform (as opposed to the OP's desired behavior)

Just to note that my main issue here and the reason I opened it is, as you pointed out, the inconsistency between subsetting and not subsetting. The intended behavior being a transform vs agg is less important to me as long as it's consistent in both cases :)

Thanks for taking a look at this!

@rhshadrach
Copy link
Member

You're suggesting we only address no. 1 at the moment, is that right?

Correct, but that @mroeschke purposefully introduced this behavior gives me pause. Would like his thoughts before going further on this.

2. The open question of whether the 'on' column needs to be added as a new level to the multiindex

I think we should not be doing so, but certainly open for discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API - Consistency Internal Consistency of API/Behavior Bug Groupby Window rolling, ewma, expanding
Projects
None yet
Development

No branches or pull requests

3 participants