Description
Add a new method DataFrame.select
to select columns from a DataFrame. The exact specs are still open to discussion, here I write a draft of what the method could look like.
Basic case, select columns. Personally both as a list, or as multiple parameters with *args
should be supported for convenience:
df.select("column1", "column2")
df.select(["column1", "column2"])
Cases to consider.
What if a provided column doesn't exist? I assume we want to raise a ValueError
.
What if a column is duplicated? I assume we want to return the column twice.
How to select with a wildcard or regex? Some options:
- Not support them (users can do anything fancy with
df.columns
themselves. - Assume the column is a regex if name starts by
^
and ends with$
. For wildcards, I guess it could be ok ifcolumn*
is provided, to first check if the column with the star exists, if it does return it, otherwise assume the star is a wildcard - Accept callables, so users can do
df.select(lambda col: col.startswith("column"))
- Have extra parameters
regex
likedf.select(regex="column\d")
- Same as 2 by make users enable if explicitly with a flag
df.select("column\d", regex=True)
Personally, I'd start by 1, not supporting anything fancy, and decide later. It's way easier to add, than to remove something we don't like once released.
What to do with MultiIndex? I guess if a list of strings is provided, they should select from the first level of the MultiIndex. Should we support the elements being tuples to select multiple levels at once? I haven't worked much with MultiIndex myself for a while, @Dr-Irv maybe you have an idea on what the expectation should be.
Can anyone think of anything else not trivial for implementing this?