Description
From #34998 (comment)
..., this got me a bit thinking about the general issue we are tackling here: should
apply()
be smart and try to infer what kind of function is being applied? Or should it not try to be smart and always do the same (eg always prepend groups by default).For the specific case where the applied function is a transformation (preserving identical index), the consequence of the changes we decided to do in this PR, is that we want
apply
to no longer be smart.But currently (and this is long-time behaviour),
apply
tries to be smart and this is also documented. In the "flexible apply" section in the docs, there is a note that says:
apply
can act as a reducer, transformer, or filter function, depending on exactly what is passed to it.
So depending on the path taken, and exactly what you are grouping. Thus the grouped column(s) may be included in
the output as well as set the indices.(So if we go forward as is with this PR, that note should actually be updated.)
The consequence of this PR is that
apply
will no longer try to detect if it acted as a transformer or not (at least by default withgroup_keys=True
, asgroup_keys=False
still gives this reindexing behaviour for a transformer as discussed above).But if we no longer detect the transformer case, should we then also stop detecting the reducer case?
IMO, groupby.apply
should not be a one-stop-shop for all your UDF needs. If users want to use a reducer, use agg
; if they want to use a transformer, use transform
; if they want to use a filter, use filter
. The role of apply
, to me, should be for UDFs that don't fit into these roles. In this sense, I think apply
should not go out of its way to generate more convenient/sensible output for these classes of UDFs.
The result of a groupby is a mapping from each group to a python object. Whatever behavior is decided for apply
, it should be well-defined and entire with no more assumptions than that. Here is a naive attempt at doing so.
- If all constituent results are Series/DataFrames, concatenate results together.
- Otherwise, the result is a Series where each value is what was returned from the UDF; dtype is inferred.
- keyword-only arg on
apply
(notgroupby
) to control if the groups are added to the index, columns, or neither.
Perhaps numpy-specific support should be added in. Probably things need to be added to support UDFs that reach groupby.apply
via resample/window, but I'm not familiar enough with that code to reason about it.