BUG: groupby ffill adds labels as extra column (#21521) #26162

adbull · 2019-04-20T17:10:04Z

closes BUG: groupby().ffill() adds group labels as extra column #21521
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

codecov · 2019-04-20T20:54:20Z

Codecov Report

Merging #26162 into master will decrease coverage by 51.23%.
The diff coverage is n/a.

@@             Coverage Diff             @@
##           master   #26162       +/-   ##
===========================================
- Coverage   91.98%   40.75%   -51.24%     
===========================================
  Files         175      175               
  Lines       52377    52382        +5     
===========================================
- Hits        48180    21346    -26834     
- Misses       4197    31036    +26839

Flag	Coverage Δ
#multiple	`?`
#single	`40.75% <ø> (-0.12%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/groupby/generic.py	`13.14% <ø> (-75.89%)`	⬇️
pandas/io/formats/latex.py	`0% <0%> (-100%)`	⬇️
pandas/io/sas/sas_constants.py	`0% <0%> (-100%)`	⬇️
pandas/core/groupby/categorical.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/plotting.py	`0% <0%> (-100%)`	⬇️
pandas/tseries/converter.py	`0% <0%> (-100%)`	⬇️
pandas/io/formats/html.py	`0% <0%> (-99.37%)`	⬇️
pandas/io/sas/sas7bdat.py	`0% <0%> (-91.16%)`	⬇️
pandas/io/sas/sas_xport.py	`0% <0%> (-90.1%)`	⬇️
pandas/core/tools/numeric.py	`10.44% <0%> (-89.56%)`	⬇️
... and 132 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dc86509...e010ffe. Read the comment docs.

codecov · 2019-04-20T20:54:21Z

Codecov Report

Merging #26162 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #26162      +/-   ##
==========================================
- Coverage   91.68%   91.67%   -0.02%     
==========================================
  Files         174      174              
  Lines       50703    50697       -6     
==========================================
- Hits        46488    46477      -11     
- Misses       4215     4220       +5

Flag	Coverage Δ
#multiple	`90.18% <ø> (-0.01%)`	⬇️
#single	`41.18% <ø> (-0.17%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/groupby/groupby.py	`97.23% <ø> (-0.01%)`	⬇️
pandas/core/groupby/generic.py	`88.93% <ø> (-0.08%)`	⬇️
pandas/io/gbq.py	`78.94% <0%> (-10.53%)`	⬇️
pandas/core/frame.py	`97.01% <0%> (-0.12%)`	⬇️
pandas/util/testing.py	`90.6% <0%> (-0.11%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a6e43a4...01057a4. Read the comment docs.

pandas/core/groupby/groupby.py

pandas/core/groupby/generic.py

adbull · 2019-04-22T15:56:07Z

@jreback Done

adbull · 2019-04-23T14:26:18Z

Failing tests seem unrelated to this PR?

WillAyd · 2019-04-23T14:30:42Z

Try merging master just fixed a few things in CI

…

Sent from my iPhone

On Apr 23, 2019, at 7:26 AM, Adam Bull ***@***.***> wrote: Failing tests seem unrelated to this PR? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

pandas/tests/groupby/test_transform.py

pandas/core/groupby/groupby.py

WillAyd · 2019-04-24T17:10:05Z

pandas/core/groupby/generic.py

        output = OrderedDict(
-            (grp.name, grp.grouper) for grp in self.grouper.groupings)
+            (grp.name, grp.grouper) for grp in self.grouper.groupings
+            if grp.in_axis and grp.name in self._selected_obj)


Are these conditions ever different? Thought the point of in_axis was to state if the grouping was a column in the selected object

So in df.groupby(labels)[selection].ffill(), the first test checks whether labels is a column in df, or something else (e.g. a separate series). The second test then checks whether labels is in selection.

You need both tests, as e.g. labels could be in df but not selection; or could not be in df, but have the same name as a column in selection.

Hmm OK thanks. Sorry for all of the questions here but main thing I am trying to avoid is introducing new inconsistencies. Wondering why we don't see the same problem using shift as we do using _fill as they ultimately dispatch to _get_cythonized_result

We don't have an issue with shift because it doesn't try to add the group labels back:

>>> df = pd.DataFrame(dict(x=[1])) >>> df.groupby('x').shift() Empty DataFrame Columns: [] Index: [0] >>> df.groupby('x').ffill() x 0 1

You could argue there's some inconsistency between shift and _fill here, but it's an inconsistency that's been in pandas for a long time. If we did want to change that, would probably be a separate PR?

You could argue there's some inconsistency between shift and _fill here, but it's an inconsistency that's been in pandas for a long time. If we did want to change that, would probably be a separate PR?

Yea that's valid. The only thing I'm trying to avoid this is changing the output format once here and changing again in a subsequent release - it just makes for a lesser end user experience.

Do we need to add the group labels back for fill?

Good question -- none of the other groupby transforms add labels back, so maybe we shouldn't? In general I'd prefer inconsistency to breaking changes, but we'll be breaking with 0.24 either way. Up to you?

Totally agreed - going to be breaking in any case but would rather break once in the next major release than have to break in the next major release then break again in the subsequent one.

Can you see what breaks if you try not to add the labels back in? May be indicative of effort involved which could guide the right path forward

If you leave the labels off, it just breaks the tests you'd expect: test_group_fill_methods, test_pad_stable_sorting and test_pct_change in tests/groupby/test_transform.py.

Right those tests might have the wrong expectation.

Just so we are clear here is the current behavior with this PR:

>>> df = pd.DataFrame([['a', 1], ['b', 2], ['a', np.nan], ['b', np.nan]], columns=['key', 'val']) >>> df key val 0 a 1.0 1 b 2.0 2 a NaN 3 b NaN >>> df.groupby('key').ffill() key val 0 a 1.0 1 b 2.0 2 a 1.0 3 b 2.0 >>> df.groupby('key').shift() # or rank, cumsum, etc... val 0 NaN 1 NaN 2 1.0 3 2.0

If you get rid of the subclassed _fill implementation you'd get the following:

>>> df.groupby('key').ffill() val 0 1.0 1 2.0 2 1.0 3 2.0 >>> df.groupby('key').shift() # or rank, cumsum, etc... val 0 NaN 1 NaN 2 1.0 3 2.0

I don't think the fill methods should make an exception on the shape of the returned value when compared to any other transformation functions, hence why I'd rather you just remove _fill and update tests instead of the patch as is

adbull · 2019-04-27T09:19:08Z

@WillAyd As discussed, I've changed this to not return group labels at all, for consistency with other groupby transforms.

WillAyd · 2019-04-27T14:12:42Z

@adbull nice - looks good on initial glance. I'll dive deeper tomorrow.

Thanks for sticking with the feedback

doc/source/whatsnew/v0.25.0.rst

pandas/tests/groupby/test_transform.py

WillAyd

lgtm outside of @jreback comments

jreback · 2019-05-07T01:57:28Z

ok looks good. can you merge master. also have a look at the example in the docs: http://pandas-docs.github.io/pandas-docs-travis/user_guide/groupby.html#transformation; its the note box

Some functions will automatically transform the input when applied to a GroupBy object, but returning an object of the same shape as the original. Passing as_index=False will not affect these transformation methods.
For example: fillna, ffill, bfill, shift..

This example is showing the incorrect behavior (I believe your patch will fix it); do your test sufficiently cover this case?

jreback · 2019-05-12T21:03:52Z

@adbull if you can merge master and update

adbull · 2019-05-12T21:12:10Z

@jreback Done. Patch will fix the example in the docs, and the test covers that case.

jreback · 2019-05-12T21:22:23Z

lgtm. ping on green.

jreback · 2019-05-12T22:04:49Z

you have a linting issue

adbull · 2019-05-13T07:52:18Z

Issue is from a different PR, I didn't edit test_groupby.py

adbull · 2019-05-13T20:20:30Z

@jreback looks green

WillAyd · 2019-05-15T01:41:52Z

Thanks @adbull - nice change!

adbull force-pushed the groupby-ffill branch from f31ff65 to e010ffe Compare April 20, 2019 20:54

adbull force-pushed the groupby-ffill branch from e010ffe to a6b638c Compare April 20, 2019 21:06

jreback requested changes Apr 21, 2019

View reviewed changes

pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved

pandas/core/groupby/generic.py Outdated Show resolved Hide resolved

jreback added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Apr 21, 2019

adbull force-pushed the groupby-ffill branch from a6b638c to d969168 Compare April 22, 2019 15:53

adbull force-pushed the groupby-ffill branch from d969168 to 0252697 Compare April 23, 2019 15:13

WillAyd reviewed Apr 24, 2019

View reviewed changes

pandas/tests/groupby/test_transform.py Outdated Show resolved Hide resolved

pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved

WillAyd added the Groupby label Apr 24, 2019

WillAyd reviewed Apr 24, 2019

View reviewed changes

adbull force-pushed the groupby-ffill branch 3 times, most recently from 56b75ae to dd793c5 Compare April 27, 2019 08:45

jreback requested changes Apr 28, 2019

View reviewed changes

doc/source/whatsnew/v0.25.0.rst Outdated Show resolved Hide resolved

doc/source/whatsnew/v0.25.0.rst Outdated Show resolved Hide resolved

pandas/tests/groupby/test_transform.py Show resolved Hide resolved

pandas/tests/groupby/test_transform.py Outdated Show resolved Hide resolved

jreback requested a review from jorisvandenbossche April 28, 2019 16:01

adbull force-pushed the groupby-ffill branch from dd793c5 to d3972c4 Compare April 28, 2019 16:33

WillAyd approved these changes Apr 29, 2019

View reviewed changes

adbull force-pushed the groupby-ffill branch from d3972c4 to 3514b45 Compare May 12, 2019 21:09

jreback added this to the 0.25.0 milestone May 12, 2019

jreback approved these changes May 12, 2019

View reviewed changes

API: groupby ffill adds labels as extra column (pandas-dev#21521)

01057a4

adbull force-pushed the groupby-ffill branch from 3514b45 to 01057a4 Compare May 13, 2019 18:52

WillAyd merged commit 3b24fb6 into pandas-dev:master May 15, 2019

adbull deleted the groupby-ffill branch May 15, 2019 19:02

Uh oh!

BUG: groupby ffill adds labels as extra column (#21521) #26162

BUG: groupby ffill adds labels as extra column (#21521) #26162

Uh oh!

Conversation

adbull commented Apr 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 20, 2019

Codecov Report

Uh oh!

codecov bot commented Apr 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

adbull commented Apr 22, 2019

Uh oh!

adbull commented Apr 23, 2019

Uh oh!

WillAyd commented Apr 23, 2019 via email

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adbull Apr 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adbull commented Apr 27, 2019

Uh oh!

WillAyd commented Apr 27, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

jreback commented May 7, 2019

Uh oh!

jreback commented May 12, 2019

Uh oh!

adbull commented May 12, 2019

Uh oh!

jreback commented May 12, 2019

Uh oh!

jreback commented May 12, 2019

Uh oh!

adbull commented May 13, 2019

Uh oh!

adbull commented May 13, 2019

Uh oh!

WillAyd commented May 15, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

adbull commented Apr 20, 2019 •

edited

Loading

codecov bot commented Apr 20, 2019 •

edited

Loading

adbull Apr 24, 2019 •

edited

Loading