Skip to content

repr string for pd.Grouper #17727

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Sep 30, 2017

  • closes #xxxx
  • [x ] tests added / passed
  • [ x] passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

Edit: I've made a full proposal (sans some discussion points below)

Currently the repr for Grouper and TimeGrouper is not pretty:

>>> pd.Grouper(key='key')
<pandas.core.groupby.Grouper at 0x248d5ebfd30>
>>> pd.Grouper(key='key', freq='50Min')
<pandas.core.resample.TimeGrouper at 0x248d68d95c0>

I propose adding a Grouper.__repr__, so the repr will be like:

>>> pd.Grouper(key='key')
Grouper(key='key')
>>> pd.Grouper(key='key', freq='50Min')
TimeGrouper(key='key', freq='50T')
>>> pd.Grouper(key='key', freq='50Min', label='right')
TimeGrouper(label='right', key='k', freq='50T')

The repr shows the instantiation form, so users could copy the repr and paste it in to use it, which is always nice.

See attached PR. Tests are still missing. Comments welcome.

Two points:

  • The repr calculation is a bit heavy, so I've cached it. Don't know if that is going overboard?
  • Is TimeGrouper deprecated?

@codecov
Copy link

codecov bot commented Sep 30, 2017

Codecov Report

Merging #17727 into master will decrease coverage by 0.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17727      +/-   ##
==========================================
- Coverage   91.25%   91.24%   -0.02%     
==========================================
  Files         163      163              
  Lines       50130    49876     -254     
==========================================
- Hits        45748    45509     -239     
+ Misses       4382     4367      -15
Flag Coverage Δ
#multiple 89.04% <100%> (-0.01%) ⬇️
#single 40.27% <75%> (-0.12%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby.py 92.04% <100%> (ø) ⬆️
pandas/core/resample.py 96.18% <100%> (+0.02%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/compat/pickle_compat.py 69.51% <0%> (-6.1%) ⬇️
pandas/core/indexes/range.py 92.83% <0%> (-2.83%) ⬇️
pandas/util/_decorators.py 78% <0%> (-2.71%) ⬇️
pandas/util/_validators.py 93.75% <0%> (-2.6%) ⬇️
pandas/io/html.py 84.85% <0%> (-1.13%) ⬇️
pandas/core/dtypes/concat.py 98.26% <0%> (-0.89%) ⬇️
pandas/io/common.py 68.64% <0%> (-0.85%) ⬇️
... and 52 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c176a3c...62f2de8. Read the comment docs.

@topper-123 topper-123 changed the title WIP: repr string for pd.Grouper repr string for pd.Grouper Sep 30, 2017
@gfyoung
Copy link
Member

gfyoung commented Oct 1, 2017

@topper-123 : The __repr__ looks reasonable to me, though in the future, would have been good to create an issue first to get consensus before proceeding to a PR.

@topper-123
Copy link
Contributor Author

Yeah, I just started doing this for a personal use case and thought it could be nice to have in pandas itself. so I maybe took two steps when only one was all that was needed. I felt silly to post a issue when the idea already was coded... My main concern was if you dislike caching the repr in pandas and/or the if the approach should be less complex.

@jreback
Copy link
Contributor

jreback commented Oct 1, 2017

why are you doing this?
if it actually is expensive to compute the repr we have the cache_readyonly decorator for that

this just makes the code confusing

@jreback
Copy link
Contributor

jreback commented Oct 1, 2017

Timegrouper is deprecated

@gfyoung
Copy link
Member

gfyoung commented Oct 1, 2017

I felt silly to post a issue when the idea already was coded

@topper-123 : The point of opening an issue is to get a consensus on whether we actually think it's a good idea to do this in the first place.

@jreback
Copy link
Contributor

jreback commented Oct 1, 2017

@topper-123 note that I DO like the idea of a nice repr for Grouper, just don't think the caching is needed.

@pep8speaks
Copy link

pep8speaks commented Oct 1, 2017

Hello @topper-123! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on November 06, 2017 at 22:21 Hours UTC

@@ -333,6 +333,16 @@ def _set_grouper(self, obj, sort=False):
def groups(self):
return self.grouper.groups

_init_defaults = dict(key=None, level=None, freq=None, axis=0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in other context we call this _attributes (I think we do this kind of signature generation already in resample). pls try to find examples and make consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, changed

@topper-123
Copy link
Contributor Author

topper-123 commented Oct 1, 2017

I've changed the code:

  • no more caching, always show current values
  • the init default parameters are in a class attribute; improves subclassability (user can change them upon subclassing)
  • default closed and label for TimeGrouper are calculated at call time, so they will also always be correct now
    ènd_types have one location, which also is better.
  • TimeGrouper still got a repr string: It looks to me that the top level location has been deprecated, but it will still be called through Grouper.__new__?

@topper-123 topper-123 force-pushed the Grouper_repr branch 3 times, most recently from 021921b to c52b439 Compare October 2, 2017 08:42
@topper-123
Copy link
Contributor Author

topper-123 commented Oct 2, 2017

Tests added. I've finished up, unless more comments .

EDIT: Moved the test for TimeGrouper to test_resample.py.

@topper-123 topper-123 force-pushed the Grouper_repr branch 2 times, most recently from e36d2c3 to da051ce Compare October 2, 2017 12:37
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some issues

@@ -333,6 +333,17 @@ def _set_grouper(self, obj, sort=False):
def groups(self):
return self.grouper.groups

_attributes = dict(key=None, level=None, freq=None, axis=0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this to the top of the class (_attributes)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be an OrderedDict. Though I wouldn't actually make this a dict at all, rather a list and just introspect the current values (which picks up the defaults, no need to have them in multiple places)

Copy link
Contributor Author

@topper-123 topper-123 Oct 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Used OrderDict.

The TimeGrouper in particular is even uglier now, but there are quite a bit of special cases there, so cant see a better way.

@@ -1286,6 +1286,29 @@ def _get_period_bins(self, ax):

return binner, bins, labels

# _attributes is used in __repr__below
_attributes = Grouper._attributes.copy()
_attributes.update(freq='Min', how='mean', nperiods=None, axis=0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just reset them to what they should be

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are overwriting an already set _attributes value (see the top of the class)

@@ -1028,7 +1028,7 @@ def __init__(self, freq='Min', closed=None, label=None, how='mean',
convention=None, base=0, **kwargs):
freq = to_offset(freq)

end_types = set(['M', 'A', 'Q', 'BM', 'BA', 'BQ', 'W'])
self._end_types = end_types = {'M', 'A', 'Q', 'BM', 'BA', 'BQ', 'W'}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, can see this is confusing. Moved it to class attribute.

@jreback jreback added the Output-Formatting __repr__ of pandas objects, to_string label Oct 2, 2017
@topper-123 topper-123 force-pushed the Grouper_repr branch 2 times, most recently from 93a2d42 to 236b12d Compare October 2, 2017 23:58
@@ -233,6 +233,10 @@ class Grouper(object):

>>> df.groupby(Grouper(level='date', freq='60s', axis=1))
"""
_attributes = collections.OrderedDict((('key', None), ('level', None),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as i said before these should just be a list
and _attributes is already defined IIRC you are overwriting it i think

@topper-123
Copy link
Contributor Author

I've redone this.

It's almost impossible to set _attributes as a list before __init__ for TimeGrouper because there's quite a lot of logic in TimeGrouper.__init__ for what should count as a default value.

By setting _attributes right after __init__ and setting it to an OrderedDict, the implementation works relatively simple, also for TimeGrouper.

@jreback
Copy link
Contributor

jreback commented Oct 3, 2017

@topper-123 again, not sure why you want to repeat code for the default values. you simply need to introspect them, NOT redefine the default.

@@ -252,6 +252,11 @@ def __init__(self, key=None, level=None, freq=None, axis=0, sort=False):
self.indexer = None
self.binner = None

# _attributes is used in __repr__below
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is just not needed at all

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_attributes should be a list of the actual attributes, NOT attached to the default values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is already done for the resampler to produce the signature, just follow the example.

@topper-123
Copy link
Contributor Author

The travis error seems to be unrelated to my PR.

I've made some changes that I think correspond to your comments. Note that this makes TimeGrouper more verbose:

>>> Grouper(key='key', freq='50Min', label='right')
TimeGrouper(key='key', freq=<50 * Minutes>, axis=0, sort=True, closed='left', label='right', how='mean', loffset=None)

I've made a version where only input is displayed, but is that code quite complex, because there are so many special cases in TimeGrouper.__init__.

Is this ok?

@@ -333,6 +335,17 @@ def _set_grouper(self, obj, sort=False):
def groups(self):
return self.grouper.groups

def __repr__(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why you need to introspect at all. when repr is called all of the values are set, simply iterate thru them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is to get the difference between the default values and the actual value, and only use attrubutes that deviate from the default values.

@jreback
Copy link
Contributor

jreback commented Nov 9, 2017

superseded by #18203

@jreback jreback closed this Nov 9, 2017
@topper-123 topper-123 deleted the Grouper_repr branch December 11, 2017 08:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Groupby Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants