-
-
Notifications
You must be signed in to change notification settings - Fork 19.3k
BUG: groupby ffill adds labels as extra column (#21521) #26162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #26162 +/- ##
===========================================
- Coverage 91.98% 40.75% -51.24%
===========================================
Files 175 175
Lines 52377 52382 +5
===========================================
- Hits 48180 21346 -26834
- Misses 4197 31036 +26839
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #26162 +/- ##
==========================================
- Coverage 91.68% 91.67% -0.02%
==========================================
Files 174 174
Lines 50703 50697 -6
==========================================
- Hits 46488 46477 -11
- Misses 4215 4220 +5
Continue to review full report at Codecov.
|
|
@jreback Done |
|
Failing tests seem unrelated to this PR? |
|
Try merging master just fixed a few things in CI
…Sent from my iPhone
On Apr 23, 2019, at 7:26 AM, Adam Bull ***@***.***> wrote:
Failing tests seem unrelated to this PR?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
pandas/core/groupby/generic.py
Outdated
| output = OrderedDict( | ||
| (grp.name, grp.grouper) for grp in self.grouper.groupings) | ||
| (grp.name, grp.grouper) for grp in self.grouper.groupings | ||
| if grp.in_axis and grp.name in self._selected_obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these conditions ever different? Thought the point of in_axis was to state if the grouping was a column in the selected object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in df.groupby(labels)[selection].ffill(), the first test checks whether labels is a column in df, or something else (e.g. a separate series). The second test then checks whether labels is in selection.
You need both tests, as e.g. labels could be in df but not selection; or could not be in df, but have the same name as a column in selection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm OK thanks. Sorry for all of the questions here but main thing I am trying to avoid is introducing new inconsistencies. Wondering why we don't see the same problem using shift as we do using _fill as they ultimately dispatch to _get_cythonized_result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have an issue with shift because it doesn't try to add the group labels back:
>>> df = pd.DataFrame(dict(x=[1]))
>>> df.groupby('x').shift()
Empty DataFrame
Columns: []
Index: [0]
>>> df.groupby('x').ffill()
x
0 1You could argue there's some inconsistency between shift and _fill here, but it's an inconsistency that's been in pandas for a long time. If we did want to change that, would probably be a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could argue there's some inconsistency between
shiftand_fillhere, but it's an inconsistency that's been in pandas for a long time. If we did want to change that, would probably be a separate PR?
Yea that's valid. The only thing I'm trying to avoid this is changing the output format once here and changing again in a subsequent release - it just makes for a lesser end user experience.
Do we need to add the group labels back for fill?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question -- none of the other groupby transforms add labels back, so maybe we shouldn't? In general I'd prefer inconsistency to breaking changes, but we'll be breaking with 0.24 either way. Up to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agreed - going to be breaking in any case but would rather break once in the next major release than have to break in the next major release then break again in the subsequent one.
Can you see what breaks if you try not to add the labels back in? May be indicative of effort involved which could guide the right path forward
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you leave the labels off, it just breaks the tests you'd expect: test_group_fill_methods, test_pad_stable_sorting and test_pct_change in tests/groupby/test_transform.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right those tests might have the wrong expectation.
Just so we are clear here is the current behavior with this PR:
>>> df = pd.DataFrame([['a', 1], ['b', 2], ['a', np.nan], ['b', np.nan]], columns=['key', 'val'])
>>> df
key val
0 a 1.0
1 b 2.0
2 a NaN
3 b NaN
>>> df.groupby('key').ffill()
key val
0 a 1.0
1 b 2.0
2 a 1.0
3 b 2.0
>>> df.groupby('key').shift() # or rank, cumsum, etc...
val
0 NaN
1 NaN
2 1.0
3 2.0If you get rid of the subclassed _fill implementation you'd get the following:
>>> df.groupby('key').ffill()
val
0 1.0
1 2.0
2 1.0
3 2.0
>>> df.groupby('key').shift() # or rank, cumsum, etc...
val
0 NaN
1 NaN
2 1.0
3 2.0I don't think the fill methods should make an exception on the shape of the returned value when compared to any other transformation functions, hence why I'd rather you just remove _fill and update tests instead of the patch as is
56b75ae to
dd793c5
Compare
|
@WillAyd As discussed, I've changed this to not return group labels at all, for consistency with other groupby transforms. |
|
@adbull nice - looks good on initial glance. I'll dive deeper tomorrow. Thanks for sticking with the feedback |
WillAyd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm outside of @jreback comments
|
ok looks good. can you merge master. also have a look at the example in the docs: http://pandas-docs.github.io/pandas-docs-travis/user_guide/groupby.html#transformation; its the note box This example is showing the incorrect behavior (I believe your patch will fix it); do your test sufficiently cover this case? |
|
@adbull if you can merge master and update |
|
@jreback Done. Patch will fix the example in the docs, and the test covers that case. |
|
lgtm. ping on green. |
|
you have a linting issue |
|
Issue is from a different PR, I didn't edit test_groupby.py |
|
@jreback looks green |
|
Thanks @adbull - nice change! |
git diff upstream/master -u -- "*.py" | flake8 --diff