-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhancement: Subset selection similar to R dplyr filter #26809
Comments
You are aware that you can rewrite your original function like this right? data[(data['some_column'] > 0) & (complicated_predicate(data))] There's also Numexpr for some ops. Unless I'm missing something not clear the suggested syntax adds anything |
Thank you, yes, I am aware. Maybe I should have been more clear that the suggestion is only for readability and ease of editing. I often have to perform a long list of related operations (maybe 10-15 in a row) to datasets and in these cases I find it most readable if they are expressed as a chain of calls with each line representing one thought. Unless the
The other way to preserve these benefits is to do like d = data
d = d[d > 0]
d = d[complicated_predicate(d)]
d = d.unstack()
d = d.groupby('some_level').mean()
# etc, maybe 5-10 more lines of stacking, unstacking, selecting, grouping, ...
result = d # finally assign a more meaningful name But in that case I find it much more readable, and faster, to write result = (
data
.subset(lambda d: d['some_column'] > 0)
.subset(complicated_predicate)
.unstack()
.groupby('some_level').mean()
# etc
) I find that the latter example is much more readable because it's easier to scan the code and say to myself "uh-huh, subset, subset, unstack, group, mean, ...". In the end this helps me focus on the problem domain. |
@rasmuse you can already do this in a nice chained way as
here's a nice article: and in the pandas docs: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#selection-by-callable |
@jreback Wow, thank you for that! Embarassed I had not seen it in the docs yet. I had already seen the article at towardsdatascience.com and it very much resembles my way of working, but it did not mention And just for completeness I guess you meant to write data_subset = (
data
.loc[lambda d: d['some_column'] > 0]
.loc[complicated_predicate]
# etc, chaining operations as necessary
)
|
You might be interested in #26642
/ related issues.
Basically, we go the names of .filter and .select backwards. We may be able
to rectify this in the future.
…On Wed, Jun 12, 2019 at 11:28 AM Jeff Reback ***@***.***> wrote:
Closed #26809 <#26809>.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#26809?email_source=notifications&email_token=AAKAOIQEVQYQSTWQA6SCOSTP2EP2NA5CNFSM4HXH32G2YY3PNVWWK3TUL52HS4DFWZEXG43VMVCXMZLOORHG65DJMZUWGYLUNFXW5KTDN5WW2ZLOORPWSZGOR6DN3NI#event-2407980469>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOITLX7BEWNBUFQVX43LP2EP2NANCNFSM4HXH32GQ>
.
|
Edited: Fixed typo
This is a feature request. I start by showing how I usually work and then suggest a new method
NDFrame.subset
including an implementation sketch.I have also written a longer explanation and some examples here (personal blog) and here (jupyter notebook).
Happy to discuss, and if there is interest in this I could probably provide a PR.
Code Sample
To filter row subsets of DataFrames or Series based on the values I usually write something like
Problem description
This works perfectly well but the syntax seems unneccesarily complicated given how often I do this operation. For comparison, using
filter()
in R'sdplyr
package you would writeTo my eyes (although I don't use R much), the R code makes the intention more visible because
filter
is much more specific thanpipe
.Suggestion
I would suggest adding a method
subset
toNDFrame
. A minimal implementation could be something like this:This could be used as follows:
The text was updated successfully, but these errors were encountered: