Description
Edited: Fixed typo
This is a feature request. I start by showing how I usually work and then suggest a new method NDFrame.subset
including an implementation sketch.
I have also written a longer explanation and some examples here (personal blog) and here (jupyter notebook).
Happy to discuss, and if there is interest in this I could probably provide a PR.
Code Sample
To filter row subsets of DataFrames or Series based on the values I usually write something like
data_subset = (
data
.pipe(lambda d: d[d['some_column'] > 0])
.pipe(lambda d: d[complicated_predicate(d)])
# etc, chaining operations as necessary
)
Problem description
This works perfectly well but the syntax seems unneccesarily complicated given how often I do this operation. For comparison, using filter()
in R's dplyr
package you would write
data.subset <- data %>%
filter(some.column > 0) %>%
filter(complicated.predicate(.)) %>%
#etc
To my eyes (although I don't use R much), the R code makes the intention more visible because
- the R code has less brackets etc, and
- the verb
filter
is much more specific thanpipe
.
Suggestion
I would suggest adding a method subset
to NDFrame
. A minimal implementation could be something like this:
class NDFrame:
...
def subset(d, predicate, *args, **kwargs):
return d[predicate(d, *args, **kwargs)]
This could be used as follows:
data_subset = (
data
.subset(lambda d: d['some_column'] > 0)
.subset(complicated_predicate)
# etc, chaining operations as necessary
)