Description
Hi!
I've put together a package that implements quite powerful column selection and renaming capabilities for DataFrames.jl, Selections.jl and would love to see it incorporated into DataFrames.jl.
You can select columns based on their names, positions, ranges and regular expressions, just like DataFrames does. Apart from that one can select columns by boolean indexing and by applying predicate functions to column names or values or both; so you can (de)select columns having more than 60 % missing values, whose names are all caps containing the string "ID" like this:
using DataFrames: DataFrame
using Selections, Statistics
julia> df = DataFrame(A_ID = 1:4, b_ID = repeat([missing], 4), C_ID = [missing, missing, missing, 1])
4×3 DataFrame
│ Row │ A_ID │ b_ID │ C_ID │
│ │ Int64 │ Missing │ Int64⍰ │
├─────┼───────┼─────────┼─────────┤
│ 1 │ 1 │ missing │ missing │
│ 2 │ 2 │ missing │ missing │
│ 3 │ 3 │ missing │ missing │
│ 4 │ 4 │ missing │ 1 │
julia> select(df, if_pairs((k,v) -> uppercase(k) == k && occursin("ID", k) && (mean(ismissing.(v)) > 0.6)))
4×1 DataFrame
│ Row │ C_ID │
│ │ Int64⍰ │
├─────┼─────────┤
│ 1 │ missing │
│ 2 │ missing │
│ 3 │ missing │
│ 4 │ 1 │
- You can also chain all kinds of conditions together using
&
and|
in order to create quite complex selection rules. - All selection conditions can be negated which will select the complement of the original selection.
- You can also rename selected columns and you can apply multiple renaming functions to multiple columns based on the selection criteria
# here I use Selections.rename to make sure I keep all the columns in their original order
julia> rename(df, -1 => key_suffix("_B"), r"^[A-Z]" => key_prefix("ac_"))
4×3 DataFrame
│ Row │ ac_A_ID │ b_ID_B │ ac_C_ID_B │
│ │ Int64 │ Missing │ Int64⍰ │
├─────┼─────────┼─────────┼───────────┤
│ 1 │ 1 │ missing │ missing │
│ 2 │ 2 │ missing │ missing │
│ 3 │ 3 │ missing │ missing │
│ 4 │ 4 │ missing │ 1 │
Please see the README.md for a more comprehensive description of the package.
Currently Selections export both select
and rename
functions which is conflicting with DataFrames exports. So my question is -- would you like this functionality to be a part of DataFrames? I'd be more than happy to make the necessary changes (e.g, make the api compliant with DataAPI) and iron out the API if you think there is room for improvement. In any case, I'd love to get some feedback on the package so that it can be useful for the community.
Thank you for reading this.:)
Activity