Skip to content

PERF: Cythonize Groupby Rank #19481

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 35 commits into from
Feb 10, 2018
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
396f1b6
Initial working rank with no tiebreaker
WillAyd Jan 29, 2018
c2c2177
Allowed kwargs to pass through to Cython func
WillAyd Jan 31, 2018
529503f
Comprehensive tests for all groupby rank args
WillAyd Jan 31, 2018
c7faa3b
Working avg tiebreak with nan handling
WillAyd Feb 1, 2018
baeb192
Added remaining tiebreakers; fixed int/float dtype mixup
WillAyd Feb 1, 2018
07c8e0f
Added func for obj support
WillAyd Feb 1, 2018
2ba6643
Added pct support
WillAyd Feb 1, 2018
4e54aa5
Added support for sorting
WillAyd Feb 1, 2018
428d32c
Working tests (excl missing data)
WillAyd Feb 1, 2018
902ef3c
Added Timestamps to tests
WillAyd Feb 1, 2018
ecd4b51
Working rank with numeric and missing
WillAyd Feb 5, 2018
e17433d
Added missing obj support
WillAyd Feb 5, 2018
b0ea557
Added support for timestamps mixed with nan
WillAyd Feb 5, 2018
e15b4b2
Added tests for multiple groups
WillAyd Feb 5, 2018
04eb4f1
Fixed bug with First tiebreak across multiple groups
WillAyd Feb 5, 2018
7a4602d
Variable Name Cleanup
WillAyd Feb 5, 2018
7be3bf3
Converted kwargs to positional arguments in Cython layer
WillAyd Feb 6, 2018
ca28350
Lint fixes
WillAyd Feb 6, 2018
913ce94
Created enum for rank tiebreakers
WillAyd Feb 6, 2018
4755941
Fixed build errors; Py <3.5 support
WillAyd Feb 6, 2018
d4a6662
LINT fixes
WillAyd Feb 6, 2018
56e7974
Fixed isnan reference issue on Windows
WillAyd Feb 7, 2018
9d7c3e6
Updated whatsnew
WillAyd Feb 7, 2018
178654d
Added GroupBy object raises tests
WillAyd Feb 7, 2018
f6ae88a
Raise ValueError in group_rank_object
WillAyd Feb 7, 2018
caacef2
Used anonymous func for rank wrapper
WillAyd Feb 7, 2018
a315a92
Removed group_rank_object
WillAyd Feb 8, 2018
a6ca485
Added comments to groupby_helper
WillAyd Feb 8, 2018
fd29d70
Added tests for rank bugs
WillAyd Feb 8, 2018
b9e4719
Fixed issue with ranks not resetting across groups
WillAyd Feb 8, 2018
613384c
Changed types; fixed tiebreaker float casting issue
WillAyd Feb 8, 2018
94a2749
Documentation cleanup
WillAyd Feb 8, 2018
3ee99c0
Removed unused import from groupby.pyx
WillAyd Feb 8, 2018
b430635
Removed npy_isnan import
WillAyd Feb 9, 2018
aa4578d
Added grp_sizes array, broke out pct calc
WillAyd Feb 9, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Documentation cleanup
  • Loading branch information
WillAyd committed Feb 9, 2018
commit 94a2749c373ec22bfa92c254612f98d44fef1adb
25 changes: 23 additions & 2 deletions pandas/_libs/groupby_helper.pxi.in
Original file line number Diff line number Diff line change
Expand Up @@ -452,8 +452,29 @@ def group_rank_{{name}}(ndarray[float64_t, ndim=2] out,
ndarray[int64_t] labels,
bint is_datetimelike, object ties_method,
bint ascending, bint pct, object na_option):
"""
Only transforms on axis=0
"""Provides the rank of values within each group

Parameters
----------
out : array of float64_t values which this method will write its results to
values : array of {{c_type}} values to be ranked
labels : array containing unique label for each group, with its ordering
matching up to the corresponding record in `values`
is_datetimelike : bool
unused in this method but provided for call compatability with other
Cython transformations
ties_method : {'keep', 'top', 'bottom'}
* keep: leave NA values where they are
* top: smallest rank if ascending
* bottom: smallest rank if descending
ascending : boolean
False for ranks by high (1) to low (N)
pct : boolean
Compute percentage rank of data within each group

Notes
-----
This method modifies the `out` parameter rather than returning an object
"""
cdef:
TiebreakEnumType tiebreak
Expand Down
35 changes: 29 additions & 6 deletions pandas/core/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -1776,7 +1776,29 @@ def cumcount(self, ascending=True):
@Appender(_doc_template)
def rank(self, method='average', ascending=True, na_option='keep',
pct=False, axis=0):
"""Rank within each group"""
"""Provides the rank of values within each group

Parameters
----------
method : {'average', 'min', 'max', 'first', 'dense'}, efault 'average'
* average: average rank of group
* min: lowest rank in group
* max: highest rank in group
* first: ranks assigned in order they appear in the array
* dense: like 'min', but rank always increases by 1 between groups
method : {'keep', 'top', 'bottom'}, default 'keep'
* keep: leave NA values where they are
* top: smallest rank if ascending
* bottom: smallest rank if descending
ascending : boolean, default True
False for ranks by high (1) to low (N)
pct : boolean, default False
Compute percentage rank of data within each group

Returns
-----
DataFrame with ranking of values within each group
"""
return self._cython_transform('rank', numeric_only=False,
ties_method=method, ascending=ascending,
na_option=na_option, pct=pct, axis=axis)
Expand Down Expand Up @@ -2198,11 +2220,12 @@ def get_group_levels(self):
'cummax': 'group_cummax',
'rank': {
'name': 'group_rank',
'f': lambda func, a, b, c, d, **kwargs: func(a, b, c, d,
kwargs.get('ties_method', 'average'),
kwargs.get('ascending', True),
kwargs.get('pct', False),
kwargs.get('na_option', 'keep')
'f': lambda func, a, b, c, d, **kwargs: func(
a, b, c, d,
kwargs.get('ties_method', 'average'),
kwargs.get('ascending', True),
kwargs.get('pct', False),
kwargs.get('na_option', 'keep')
)
}
}
Expand Down