Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Support kurtosis (kurt) in DataFrameGroupBy and SeriesGroupBy #60433

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

snitish
Copy link
Contributor

@snitish snitish commented Nov 27, 2024

DataFrameGroupBy and SeriesGroupBy currently support mean, std and skew (the first 3 moments) but not kurtosis (the 4th moment). This change addresses that. I implemented kurtosis in cython in similar fashion to skewness. I've verified that the output of this function matches that of DataFrame.kurt().

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! A few test requests (I think these are not covered yet):

  • skipna with NA values in the data
  • Float64 and float64[pyarrow] dtypes
  • Constant data (e.g. [1, 1, 1, 1])

pandas/_libs/groupby.pyx Outdated Show resolved Hide resolved
pandas/_libs/groupby.pyx Outdated Show resolved Hide resolved
pandas/core/groupby/generic.py Outdated Show resolved Hide resolved
pandas/tests/groupby/methods/test_kurt.py Outdated Show resolved Hide resolved
pandas/tests/groupby/methods/test_kurt.py Show resolved Hide resolved
@snitish
Copy link
Contributor Author

snitish commented Dec 4, 2024

Thanks for the review @rhshadrach.

  • Addressed your comments
  • Added test case for skipna=False (by default it's true)
  • Added test case for float64[pyarrow] (we already have one for float64)
  • Added test case for constant data. Note that the result here is 0.0, consistent with DataFrame.kurt() and Series.kurt()

@snitish snitish requested a review from rhshadrach December 4, 2024 03:05
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test case for float64[pyarrow] (we already have one for float64)

The request was for Float64, the NumPy-nullable array. Can just parameterize your arrow test I think:

@pytest.mark.parametrize("dtype", [pytest.param("float64[pyarrow]", marks=td.skip_if_no("pyarrow")), "Float64")

doc/source/whatsnew/v3.0.0.rst Outdated Show resolved Hide resolved
# GH#40139
# Test that that groupby kurt method (which uses libgroupby.group_kurt)
# matches the results of operating group-by-group (which uses nanops.nankurt)
nrows = 1000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was first concerned about runtime, but 10, 100, and 1000 all run in about the same time on my machine, the bottleneck appears to be O(1) overhead. I don't see O(n) behavior until 100_000.

Comment on lines +18 to +19
arr = np.random.default_rng(2).standard_normal((nrows, ncols))
arr[np.random.default_rng(2).random(nrows) < nan_frac] = np.nan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke - I think you reworked the random data generation a while back, want to make sure this agrees with those patterns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Reduction Operations sum, mean, min, max, etc.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH:AttributeError: 'SeriesGroupBy' object has no attribute 'kurtosis'
3 participants