API: Should apply be smart?

From https://github.com/pandas-dev/pandas/pull/34998#issuecomment-759416692

> ..., this got me a bit thinking about the general issue we are tackling here: **should `apply()` be smart and try to infer what kind of function is being applied?** Or should it not try to be smart and always do the same (eg always prepend groups by default). 
>
> For the specific case where the applied function is a transformation (preserving identical index), the consequence of the changes we decided to do in this PR, is that we want `apply` to no longer be smart. 
>
> But currently (and this is long-time behaviour), `apply` tries to be smart and this is also documented. In the "flexible apply" section in the docs, there is a note that says:
>
> >    ``apply`` can act as a reducer, transformer, *or* filter function, depending on exactly what is passed to it.
   So depending on the path taken, and exactly what you are grouping. Thus the grouped column(s) may be included in
   the output as well as set the indices.
>
> (So if we go forward as is with this PR, that note should actually be updated.) 
>
> The consequence of this PR is that `apply` will no longer try to detect if it acted as a *transformer* or not (at least by default with `group_keys=True`, as `group_keys=False` still gives this reindexing behaviour for a transformer as discussed above). 
>
> But if we no longer detect the transformer case, should we then also stop detecting the *reducer* case? 

IMO, `groupby.apply` should not be a one-stop-shop for all your UDF needs. If users want to use a reducer, use `agg`; if they want to use a transformer, use `transform`; if they want to use a filter, use `filter`. The role of `apply`, to me, should be for UDFs that don't fit into these roles. In this sense, I think `apply` should not go out of its way to generate more convenient/sensible output for these classes of UDFs.

The result of a groupby is a mapping from each group to a python object. Whatever behavior is decided for `apply`, it should be well-defined and entire with no more assumptions than that. Here is a naive attempt at doing so.

 - If all constituent results are Series/DataFrames, concatenate results together.
 - Otherwise, the result is a Series where each value is what was returned from the UDF; dtype is inferred.
 - keyword-only arg on `apply` (not `groupby`) to control if the groups are added to the index, columns, or neither.

Perhaps numpy-specific support should be added in. Probably things need to be added to support UDFs that reach `groupby.apply` via resample/window, but I'm not familiar enough with that code to reason about it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

API: Should apply be smart? #39209

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

API: Should apply be smart? #39209

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions