Skip to content

Enhancement: Subset selection similar to R dplyr filter #26809

Closed
@rasmuse

Description

@rasmuse

Edited: Fixed typo

This is a feature request. I start by showing how I usually work and then suggest a new method NDFrame.subset including an implementation sketch.

I have also written a longer explanation and some examples here (personal blog) and here (jupyter notebook).

Happy to discuss, and if there is interest in this I could probably provide a PR.

Code Sample

To filter row subsets of DataFrames or Series based on the values I usually write something like

data_subset = (
    data
    .pipe(lambda d: d[d['some_column'] > 0])
    .pipe(lambda d: d[complicated_predicate(d)])
    # etc, chaining operations as necessary
)

Problem description

This works perfectly well but the syntax seems unneccesarily complicated given how often I do this operation. For comparison, using filter() in R's dplyr package you would write

data.subset <- data %>%
    filter(some.column > 0) %>%
    filter(complicated.predicate(.)) %>%
    #etc

To my eyes (although I don't use R much), the R code makes the intention more visible because

  1. the R code has less brackets etc, and
  2. the verb filter is much more specific than pipe.

Suggestion

I would suggest adding a method subset to NDFrame. A minimal implementation could be something like this:

class NDFrame:

    ...

    def subset(d, predicate, *args, **kwargs):
        return d[predicate(d, *args, **kwargs)]

This could be used as follows:

data_subset = (
    data
    .subset(lambda d: d['some_column'] > 0)
    .subset(complicated_predicate)
    # etc, chaining operations as necessary
)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions