Skip to content

API: Should apply be smart? #39209

Open
Open
@rhshadrach

Description

@rhshadrach

From #34998 (comment)

..., this got me a bit thinking about the general issue we are tackling here: should apply() be smart and try to infer what kind of function is being applied? Or should it not try to be smart and always do the same (eg always prepend groups by default).

For the specific case where the applied function is a transformation (preserving identical index), the consequence of the changes we decided to do in this PR, is that we want apply to no longer be smart.

But currently (and this is long-time behaviour), apply tries to be smart and this is also documented. In the "flexible apply" section in the docs, there is a note that says:

apply can act as a reducer, transformer, or filter function, depending on exactly what is passed to it.
So depending on the path taken, and exactly what you are grouping. Thus the grouped column(s) may be included in
the output as well as set the indices.

(So if we go forward as is with this PR, that note should actually be updated.)

The consequence of this PR is that apply will no longer try to detect if it acted as a transformer or not (at least by default with group_keys=True, as group_keys=False still gives this reindexing behaviour for a transformer as discussed above).

But if we no longer detect the transformer case, should we then also stop detecting the reducer case?

IMO, groupby.apply should not be a one-stop-shop for all your UDF needs. If users want to use a reducer, use agg; if they want to use a transformer, use transform; if they want to use a filter, use filter. The role of apply, to me, should be for UDFs that don't fit into these roles. In this sense, I think apply should not go out of its way to generate more convenient/sensible output for these classes of UDFs.

The result of a groupby is a mapping from each group to a python object. Whatever behavior is decided for apply, it should be well-defined and entire with no more assumptions than that. Here is a naive attempt at doing so.

  • If all constituent results are Series/DataFrames, concatenate results together.
  • Otherwise, the result is a Series where each value is what was returned from the UDF; dtype is inferred.
  • keyword-only arg on apply (not groupby) to control if the groups are added to the index, columns, or neither.

Perhaps numpy-specific support should be added in. Probably things need to be added to support UDFs that reach groupby.apply via resample/window, but I'm not familiar enough with that code to reason about it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    API DesignApplyApply, Aggregate, Transform, MapRefactorInternal refactoring of code

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions