-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Support kurtosis (kurt) in DataFrameGroupBy and SeriesGroupBy #60433
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! A few test requests (I think these are not covered yet):
- skipna with NA values in the data
- Float64 and float64[pyarrow] dtypes
- Constant data (e.g.
[1, 1, 1, 1]
)
Thanks for the review @rhshadrach.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added test case for float64[pyarrow] (we already have one for float64)
The request was for Float64
, the NumPy-nullable array. Can just parameterize your arrow test I think:
@pytest.mark.parametrize("dtype", [pytest.param("float64[pyarrow]", marks=td.skip_if_no("pyarrow")), "Float64")
# GH#40139 | ||
# Test that that groupby kurt method (which uses libgroupby.group_kurt) | ||
# matches the results of operating group-by-group (which uses nanops.nankurt) | ||
nrows = 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was first concerned about runtime, but 10, 100, and 1000 all run in about the same time on my machine, the bottleneck appears to be O(1) overhead. I don't see O(n) behavior until 100_000.
arr = np.random.default_rng(2).standard_normal((nrows, ncols)) | ||
arr[np.random.default_rng(2).random(nrows) < nan_frac] = np.nan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mroeschke - I think you reworked the random data generation a while back, want to make sure this agrees with those patterns.
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.DataFrameGroupBy and SeriesGroupBy currently support mean, std and skew (the first 3 moments) but not kurtosis (the 4th moment). This change addresses that. I implemented kurtosis in cython in similar fashion to skewness. I've verified that the output of this function matches that of DataFrame.kurt().