Skip to content

ENH: Implement DataFrame.select to select columns #61522

Open
@datapythonista

Description

@datapythonista

Add a new method DataFrame.select to select columns from a DataFrame. The exact specs are still open to discussion, here I write a draft of what the method could look like.

Basic case, select columns. Personally both as a list, or as multiple parameters with *args should be supported for convenience:

df.select("column1", "column2")
df.select(["column1", "column2"])

Cases to consider.

What if a provided column doesn't exist? I assume we want to raise a ValueError.

What if a column is duplicated? I assume we want to return the column twice.

How to select with a wildcard or regex? Some options:

  1. Not support them (users can do anything fancy with df.columns themselves.
  2. Assume the column is a regex if name starts by ^ and ends with $. For wildcards, I guess it could be ok if column* is provided, to first check if the column with the star exists, if it does return it, otherwise assume the star is a wildcard
  3. Accept callables, so users can do df.select(lambda col: col.startswith("column"))
  4. Have extra parameters regex like df.select(regex="column\d")
  5. Same as 2 by make users enable if explicitly with a flag df.select("column\d", regex=True)

Personally, I'd start by 1, not supporting anything fancy, and decide later. It's way easier to add, than to remove something we don't like once released.

What to do with MultiIndex? I guess if a list of strings is provided, they should select from the first level of the MultiIndex. Should we support the elements being tuples to select multiple levels at once? I haven't worked much with MultiIndex myself for a while, @Dr-Irv maybe you have an idea on what the expectation should be.

Can anyone think of anything else not trivial for implementing this?

Metadata

Metadata

Assignees

No one assigned

    Labels

    EnhancementIndexingRelated to indexing on series/frames, not to indexes themselvesNeeds DiscussionRequires discussion from core team before further action

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions