Skip to content

BUG: groupby.tshift inconsistent behavior with other groupby transformations  #34452

Closed
@fujiaxiang

Description

I discovered this while trying to tackle issue #32344, where @ryankarlos mentioned groupby.transform('tshift', ...) seems to behave incorrectly.

However, before we can address #32344, we probably need to address this.

# on current master
>>> import pandas as pd
>>> import numpy as np

>>> pd.__version__
'1.1.0.dev0+1708.g043b60920'

>>> df = pd.DataFrame(
...     {
...     "A": ["foo", "foo", "foo", "foo", "bar", "bar", "baz"],
...     "B": [1, 2, np.nan, 3, 3, np.nan, 4],
...     },
...     index=pd.date_range('2020-01-01', '2020-01-07')
... )
>>> df
              A    B
2020-01-01  foo  1.0
2020-01-02  foo  2.0
2020-01-03  foo  NaN
2020-01-04  foo  3.0
2020-01-05  bar  3.0
2020-01-06  bar  NaN
2020-01-07  baz  4.0

>>> df.groupby("A").tshift(1, "D")
                  B
A
bar 2020-01-06  3.0
    2020-01-07  NaN
baz 2020-01-08  4.0
foo 2020-01-02  1.0
    2020-01-03  2.0
    2020-01-04  NaN
    2020-01-05  3.0

>>> df.groupby("A").ffill()
              B
2020-01-01  1.0
2020-01-02  2.0
2020-01-03  2.0
2020-01-04  3.0
2020-01-05  3.0
2020-01-06  3.0
2020-01-07  4.0

>>> df.groupby("A").cumsum()
              B
2020-01-01  1.0
2020-01-02  3.0
2020-01-03  NaN
2020-01-04  6.0
2020-01-05  3.0
2020-01-06  NaN
2020-01-07  4.0

We can see that groupby.tshift is inconsistent with other groupby transformations. It retains the groupby column, and more importantly reordered the data.

Since 0.25 we have had deliberate effort to make all groupby transformations consistent, see https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html#dataframe-groupby-ffill-bfill-no-longer-return-group-labels

Following this thinking I would expect the returned data to behave more like

>>> df.groupby("A").tshift(1, "D")  # this is actually the result of df.tshift(1, "D").drop(columns='A')
              B
2020-01-02  1.0
2020-01-03  2.0
2020-01-04  NaN
2020-01-05  3.0
2020-01-06  3.0
2020-01-07  NaN
2020-01-08  4.0

However, if we are to make groupby.tshift consistent with other groupby transformation like the above, this makes it no different from df.tshift(1, "D").drop(columns='A')', and groupby` has lost its meaning here.

Perhaps we should just deprecate groupby.tshift entirely? I know #11631 discussed about deprecating tshift, but that has been stalled for a long time.

Metadata

Assignees

No one assigned

    Labels

    DeprecateFunctionality to remove in pandasGroupby

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions