Skip to content

Selections.jl + DataFrames.jl #1936

Closed
Closed
@Drvi

Description

@Drvi

Hi!

I've put together a package that implements quite powerful column selection and renaming capabilities for DataFrames.jl, Selections.jl and would love to see it incorporated into DataFrames.jl.

You can select columns based on their names, positions, ranges and regular expressions, just like DataFrames does. Apart from that one can select columns by boolean indexing and by applying predicate functions to column names or values or both; so you can (de)select columns having more than 60 % missing values, whose names are all caps containing the string "ID" like this:

using DataFrames: DataFrame
using Selections, Statistics

julia> df = DataFrame(A_ID = 1:4, b_ID = repeat([missing], 4), C_ID = [missing, missing, missing, 1])
4×3 DataFrame
│ Row │ A_ID  │ b_ID    │ C_ID    │
│     │ Int64 │ Missing │ Int64⍰  │
├─────┼───────┼─────────┼─────────┤
│ 11missingmissing │
│ 22missingmissing │
│ 33missingmissing │
│ 44missing1       │


julia> select(df, if_pairs((k,v) -> uppercase(k) == k && occursin("ID", k) && (mean(ismissing.(v)) > 0.6)))
4×1 DataFrame
│ Row │ C_ID    │
│     │ Int64⍰  │
├─────┼─────────┤
│ 1missing │
│ 2missing │
│ 3missing │
│ 41
  • You can also chain all kinds of conditions together using & and | in order to create quite complex selection rules.
  • All selection conditions can be negated which will select the complement of the original selection.
  • You can also rename selected columns and you can apply multiple renaming functions to multiple columns based on the selection criteria
# here I use Selections.rename to make sure I keep all the columns in their original order
julia> rename(df, -1 => key_suffix("_B"), r"^[A-Z]" => key_prefix("ac_"))
4×3 DataFrame
│ Row │ ac_A_ID │ b_ID_B  │ ac_C_ID_B │
│     │ Int64   │ Missing │ Int64⍰    │
├─────┼─────────┼─────────┼───────────┤
│ 11missingmissing   │
│ 22missingmissing   │
│ 33missingmissing   │
│ 44missing1

Please see the README.md for a more comprehensive description of the package.

Currently Selections export both select and rename functions which is conflicting with DataFrames exports. So my question is -- would you like this functionality to be a part of DataFrames? I'd be more than happy to make the necessary changes (e.g, make the api compliant with DataAPI) and iron out the API if you think there is room for improvement. In any case, I'd love to get some feedback on the package so that it can be useful for the community.

Thank you for reading this.:)

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions