-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Method for selecting columns from DataFrameGroupBy (and maybe DataFrame) #40322
Comments
I think I like the idea of ripping off dply'rs So overall I think +1 to adding |
Just to clarify, dplyr's |
By "targeted to selecting rows" I obviously meant "targeted to selecting columns" :) Fixed. |
why is
just not the answer here? why is 'let's just add api' the answer? I am -1 on adding almost anything at this point to an already very complicate interface. |
Because either pandas supports method chaining as an interface or it doesn't. If it does, then it should be possible to define a sequence of operations using a single syntactical construct. Needing to switch back and forth has a cost when learning pandas, when writing code and, and when reading code. Now obviously there will be some edge cases that aren't possible through the pandas API, which is where But this usecase (group by then select) is likely one of the most common operations in pandas. I get that there is a maintenance cost to increasing the API surface, but users accrue real benefits from being able to do things in a consistent, predictable way. |
again what is wrong with the first construct itself. This does support method chaining. What exactly do you want to do that is not possible? |
It supports method chaining, then, oops, wait, gotta use the indexing operator, then ok some more method chaining. That's not exactly the same thing. I think it would be a mistake to blanket reject API additions that enhance consistency just because they don't add sufficient novel functionality. (That said, some people feel excited about using a callable to select, which isn't possible in the current API). |
Hi, the current pandas workflow incites to "extract" the bare minimum data, then apply transformations. df.query("sex == 'male'").groupby("species").apply(
lambda x: x.filter(regex="_mm").mean()
) That being said, this ⬆️ got the job done so far so I totally understand the point on the maintenance burden. |
One thing I really like about Python is that usually, when you find yourself doing something that feels really awkward, it's a clue that there's a better way to do it. That intuition has trapped me here many times. I write a pipeline that mixes method and index chaining, think "no way, that's too ugly, there must be a better syntax for this standard task" and then sink time trying to figure out what the right method is called. |
so you are saying that python itself should not have but we have these operators. I am not sure what you are actually suggesting, e.g. and please don't put words in my statement
I don't reject things off hand, instead they must have a really high bar. It doesn't need to be novel, just reduce complexity. The proposal may reduce complexity a small amount, at the enormous expense of adding non-standard usages. |
When someone takes a while to write an informative feature request, with an explanation, and then just gets a "but why add this" response that doesn't engage at all with what was written, it suggests there is a kneejerk policy of rejection. (Which is fine too, you're the project lead, but it doesn't really incentivize thoughtful feature requests). I don't think everything should be a method call. But I do think that when everything but one step in a chain of operations is a method call, it's ugly and confusing. If pandas default were to operate in place, with an option for returning a copy, it would be reasonable to say "well we don't really support method-chaining as a first-class interface". But your design decisions have pushed users in that direction. And the success of pandas — and dplyr pipelines in R — has proven that it's a really useful idiom for data transforms. So it's' painful when there's one (really common) step where it breaks down. |
I disagree. I appreciate its subjective but I see no issue with the current syntax, in fact I prefer the simplicity of the operators to another method, don't find it ugly nor painful. I think it would be more confusing to have and maintain multiple syntax achieving the same result (there are other areas I would like to see focus/consistency on selecting data). However, its is a useful comment to raise discussion, its just not for me. |
Why are operators "simpler" than a method call, in the context of a sequence of method calls? Bear in mind that pandas overloads the behavior of |
Note that
And a downside of the operators is that they don't have a docstring you can quickly check to see how they work. You need to dig up a description of the behavior from ... somewhere in the docs. An advantage of adding a |
for being the author of a nice package this is a really disrespectful response. This is a very valid response. Pandas api is WAY too big. I think we can all agree on that. The bar for adding anything is REALLY high. The initial response to virtually anything is NO. YOU have to have a really convincing argument to persuade. Feel free to get support on this issue. |
I don't mean disrespect, but I spent my whole morning writing up an issue that had clear examples, considered various alternatives, acknowledged the cost of an API expansion, and gave reasons for why I think the existing syntax is a problem. To get a response that demanded justification for the proposal without engaging on anything I wrote in the initial post was frustrating. This isn't the first time I've had this experience, and when I have ideas for features in pandas I often just don't even bother opening an issue: not because the bar to acceptance is high, but because the effort of writing a clear proposal is not respected. But we're getting distracted. I guess i'll copy what I see as the key point
This is a cognitive explanation for the (yes, subjective) reaction to the "ugly" code. It's ugly because it requires you to do too much work to understand it. IMO it is totally valid to consider consistency as a justification for increasing API. I reacted negatively to your question of "What exactly do you want to do that is not possible?" because I think I was pretty clear that the problem is not what can be done but how it is to be done. |
my response had almost nothing to do with the what you wrote / or the time you spent. It actually was reading it from a glance. I couldn't tell why the first example was not correct. Your narrative may have looked clear to you since you spent time on it. But my goal is to read these w/o a lot of context and try to determine what exactly is the problem and what the proposed solution is. Then I ask clarifying questions. This interaction clearly failed. If someone spends a lot of time writing something up, but its not clear at a glance what its about., then it needs revision. Most constributors / reviewers do not have the time to real a long extended treatise on things (not that your is, but just saying). Some like to write really long things. To be honest this is a burden on all, you can take your time to write whatever you want and expect that everyone else has the time to read it. Again I am not saying that is the case here. So my point is to be even more clear and edit / clarify the writeup. I asked several times 'why' is this not fluent and all I got defensive arguments |
Thanks for clarifying your approach. I'll consider this resolved. |
Note: this can still be achieved: (
df["x"]
.groupby([df["a"], df["b"]])
.mean()
) |
@mwaskom thanks a lot for opening the issue and providing a nice, structured proposal and argumentation. Let's not close this issue because the interaction with Jeff didn't go that well, but give time for others to join the discussion. Personally, I am +1 on adding a |
Alright, if this remains under discussion, I'll try to summarize the points I've made.
|
Hi @mwaskom Do you have an example (maybe wildy ignoring Pandas / Python rules) to illustrate what you want? I think I understand the post and it is well written. I just don't see the counterfactual. Also, I really agree with this:
There are many workflows which feel extremely weird to execute in Pandas, even to a vivid degree. This is sort of unrelated but one example is the use case of creating a unique "ID" per group. Until 2018/2019 this seems to involve undocumented functionality until being addressed by Again, off-topic but perhaps this is due to some "path dependency"/ "founder effect" of its original financial time series origin, which may then produced counter intuitive effects. |
Using the example of the top post (that shows different ways how it is possible right now to achieve this), with a
|
Thanks @jorisvandenbossche, This is helpful to me, someone who had to look up "fluent" programming. I think this example makes a lot of sense now and I think I agree with the OP's point (barring the issues of dependencies, coordination, mental energy, open source cat herding, burn out, etc. that may or may not be touched on by other discussion). |
Thanks @jorisvandenbossche; edited the initial post to add that and make the proposal more explicit. |
@mwaskom thanks for taking the time. A couple of questions to make sure i understand correctly
|
No we haven't really discussed whether it should be a
Yes, others are excited about using a method (v.s. index operator) to allow more elaborate selection rules (possibly overlapping with |
One might also consider how black treats method chaining. Running with black version 21.5b2 on
results in
In my opinion this is an undesirable result - the subsetting of columns should be on its own line, otherwise it is easy to miss when glancing at the code. However perhaps this is an upstream (black) issue. |
I may add my opinion on this. I agree with OP's point that introducing I think API consistency is important for users that do not use I personally think there are so many variation for the same method for different class(ex. And I personally think so many methods for the same function is not a downside, Arguments above inevitably makes Anyhow, I myself implemented the method import pandas as pd
def select(self, *args):
#print(args, len(args))
if len(args)==1:
if isinstance(args[0], list):
args = args[0]
else:
args = [args[0]]
else:
args = [x for x in args]
#print(args)
return self.__getitem__(args)
pd.DataFrame.select = select
pd.core.groupby.generic.DataFrameGroupBy.select = select it's just a wrapper for
|
Is your feature request related to a problem?
When using a "fluent"/"method chaining" style of programming with pandas, a common hiccup is the lack of a method for selecting columns from a
DataFrameGroupBy
object. For example, take this DataFrameTo group by one column and then return the mean of another column within each group, one has to use multiple syntactical constructs:
There is also dot-attribute access columns:
but that is not fully general: it won't work if your column name collides with an existing method, and it won't work if the column is defined by a variable.
Another option is to select the column first then groupby using a series, not a name:
This is not so bad. Well, ideally each step in the pipeline would be on a new line, and it requires some duplicated typing. But the bigger issue is that it fails when you want to group by more than one column:
Traceback
(I actually feel like this really ought to work, independently of whether it's the best solution to this Feature Request, but let's leave it aside for now).
So in summary: There's no fully-general, non-awkward method for selecting a column from
DataFrameGroupBy
. While certainly not fatal, it does make fluent pandas harder to read[1] and write[2]. And this is not an esoteric application: groupby/select/apply must be one of the most common pandas workflows.[1] The great advantage of this method chaining approach for understanding code is that you can read off the sequence of verbs that comprise the operation. Introducing the
[column]
syntax requires your brain to shift from comprehension mode to production mode — you need to generate a verb for what's happening — which is a costly operation that likely (on the margin) impairs comprehension.[2] I really like fluent pandas but it currently can be a little bit annoying to fight with autoindent/autoformat tools when writing it. It likely would be easier to improve that tooling if one could assume each pipeline step begins with a dotted method access.
Describe the solution you'd like
The
DataFrameGroupBy
object could have a column selection method.Possible names:
DataFrameGroupBy.get
This exists in
DataFrame
as a thin wrapper aroundself[arg]
. Possibly adds confusion withDataFrameGroupBy.get_group
DataFrameGroupBy.select
This is the method I have reached for more than once. I'm broadly aware of the history of
NDFrame.select
; this name should now be available, though it could cause some confusion to reintroduce it with different semantics (but xref #26642).It probably wouldn't make sense to add
.select
only toDataFrameGroupBy
and not toDataFrame
/NDFrame
. So this would be more work (but also more benefit?)In code, that might look like
API breaking implications
No breakage, unless someone has very old code using
DataFrame.select
that completely missed the original deprecation cycle and gets resurrected now.Adding new methods (especially new ways of indexing pandas objects) definitely has an API complexity cost. But I would argue that, by converging on "one way to do it" in method chains, it could provide a a net reduction in complexity from the user perspective.
Additional context
This originally got some discussion on twitter, note especially @TomAugspurger's response.
Other relevant context would be the select verb in
dplyr
.The text was updated successfully, but these errors were encountered: