Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add transformation and renaming to select and select! #2080

Merged
merged 51 commits into from
Mar 19, 2020
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
77f4623
add support for transforms in select and define transform and transform!
bkamins Jan 6, 2020
147a427
fix SubDataFrame select signature
bkamins Jan 6, 2020
11fd0a2
fix problem in autogeneration of column names
bkamins Jan 6, 2020
6fa4f84
add documentation of automatic generation of column names
bkamins Jan 7, 2020
d501fb4
improvements after code review
bkamins Jan 8, 2020
7053e5b
updates after a code review
bkamins Jan 9, 2020
ec834e2
correct variable name
bkamins Jan 10, 2020
6c76aca
minor fix
bkamins Jan 10, 2020
e59d129
fix select for SubDataFrame
bkamins Jan 10, 2020
dee8ac7
improved multiple column transformation
bkamins Jan 10, 2020
4a8a40b
improve select for SubDataFrame
bkamins Jan 10, 2020
f04a549
Apply suggestions from code review
bkamins Jan 10, 2020
bbc06f4
fixes after code review
bkamins Jan 10, 2020
498d9df
fixes from code review
bkamins Jan 12, 2020
fa5a1f1
disallow duplicates in single column selection
bkamins Jan 15, 2020
cd8f41b
fix select for SubDataFrame to avoid duplicate ColumnIndex selelctions
bkamins Jan 15, 2020
3c7149b
Apply suggestions from code review
bkamins Jan 15, 2020
807adfc
fixes after the code review
bkamins Jan 16, 2020
aa7746b
change default behavior to whole-column and add Row
bkamins Feb 1, 2020
7524706
fix typo
bkamins Feb 4, 2020
3d77f6b
add funname to Row
bkamins Feb 5, 2020
e560a14
merge normalize_selection methods
bkamins Feb 5, 2020
9caab2d
make ByRow a functor
bkamins Feb 11, 2020
db8f103
Update src/abstractdataframe/selection.jl
bkamins Feb 14, 2020
df6795a
disallow transofmation of 0 columns
bkamins Feb 14, 2020
ba1feb9
disallow 0 columns only in ByRow
bkamins Feb 15, 2020
0c30db7
Merge branch 'master' into flexible_select
bkamins Feb 15, 2020
6d03a1c
sync with Tables 1.0
bkamins Feb 15, 2020
34aa4cd
fix documentation
bkamins Feb 15, 2020
a03afd7
fix missing parenthesis
bkamins Feb 16, 2020
d4fced0
fix method signature
bkamins Feb 17, 2020
c712088
export ByRow
bkamins Feb 17, 2020
9b5c027
auto-splat (no docs update)
bkamins Feb 22, 2020
8e73abc
fix @views
bkamins Feb 22, 2020
930875e
move to broadcasting in ByRow
bkamins Feb 26, 2020
ab4103a
Apply suggestions from code review
bkamins Feb 28, 2020
4289c48
update implementation
bkamins Feb 28, 2020
6341ccc
reorganize tests
bkamins Feb 28, 2020
09e632e
first round of tests
bkamins Feb 28, 2020
df59216
disallow AbstractDataFrame, NamedTuple, DataFrameRow, and AbstractMat…
bkamins Feb 29, 2020
d932b05
fix test
bkamins Feb 29, 2020
688b077
clean up transformation implementation
bkamins Mar 1, 2020
c34ee72
further sanitizing select rules and more code explanations
bkamins Mar 1, 2020
b818d57
fix comments
bkamins Mar 1, 2020
08d4043
tests of disallowed values
bkamins Mar 2, 2020
49dff0e
finalize tests
bkamins Mar 2, 2020
d685576
fix Julia 1.0 tests
bkamins Mar 2, 2020
35f8996
stop doing pessimistic copy when copycols=true
bkamins Mar 12, 2020
78b492d
Apply suggestions from code review
bkamins Mar 18, 2020
52e690d
fixes after code review
bkamins Mar 18, 2020
20642c5
improve docstring
bkamins Mar 19, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/src/lib/types.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,7 @@ without caution because:

```@docs
AbstractDataFrame
ByRow
DataFrame
DataFrameRow
GroupedDataFrame
Expand All @@ -124,5 +125,4 @@ DataFrameRows
DataFrameColumns
RepeatedVector
StackedVector
Row
```
28 changes: 26 additions & 2 deletions docs/src/man/getting_started.md
Original file line number Diff line number Diff line change
Expand Up @@ -522,11 +522,14 @@ julia> df[in.(df.A, Ref([1, 5, 601])), :]
│ 3 │ 601 │ 7 │ 301 │
```

Equivalently, the `in` function can be called with a single argument to create a function object that tests whether each value belongs to the subset (partial application of `in`): `df[in([1, 5, 601]).(df.A), :]`.
Equivalently, the `in` function can be called with a single argument to create
a function object that tests whether each value belongs to the subset
(partial application of `in`): `df[in([1, 5, 601]).(df.A), :]`.

#### Column selection using `select` and `select!`

You can also use the [`select`](@ref) and [`select!`](@ref) functions to select columns in a data frame.
You can also use the [`select`](@ref) and [`select!`](@ref) functions to select,
rename and transform columns in a data frame.

The `select` function creates a new data frame:
```jldoctest dataframe
Expand All @@ -550,6 +553,27 @@ julia> select(df, r"x") # select columns containing 'x' character
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 2 │

julia> select(df, :x1 => :a1, :x2 => :a2) # rename columns
1×2 DataFrame
│ Row │ a1 │ a2 │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 2 │

julia> select(df, :x1, :x2 => (x -> 2x) => :x2) # transform columns
1×2 DataFrame
│ Row │ x1 │ x2 │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 4 │

julia> select(df, :x1, :x2 => ByRow(UInt8) => :x2) # transform columns by row
bkamins marked this conversation as resolved.
Show resolved Hide resolved
1×2 DataFrame
│ Row │ x1 │ x2 │
│ │ Int64 │ UInt8 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 0x02 │
```

It is important to note that `select` always returns a data frame,
Expand Down
81 changes: 41 additions & 40 deletions src/abstractdataframe/selection.jl
Original file line number Diff line number Diff line change
@@ -1,9 +1,8 @@
# TODO:
# * add transform and transfom! functions
# * update documentation
# * add tests
# * add NT (or better name) to column selector passing NamedTuple
# (also in other places: filter, combine)
# * add select/select!/transform/transform! for GroupedDataFrame

# normalize_selection function makes sure that whatever input format of idx is it
# will end up in one of four canonical forms
Expand Down Expand Up @@ -134,32 +133,32 @@ In particular, regular expressions, `All`, `Between`, and `Not` selectors are su

Columns can be renamed using the `old_column => new_column_name` syntax,
and transformed using the `old_column => fun => new_column_name` syntax.
`new_column_name` must be a `Symbol`, and `fun` a function or a type.
If `old_column` is a `Symbol` or an integer then `fun` is applied to the corresponding column vector.
`new_column_name` must be a `Symbol`, and `fun` a function or a type. If `old_column`
is a `Symbol` or an integer then `fun` is applied to the corresponding column vector.
Otherwise `old_column` can be any column indexing syntax, in which case `fun`
will be passed the column vectors specified by `old_column` as separate arguments.

To apply `fun` to each row instead of whole columns, it can be wrapped in a `ByRow` struct. In this case
if `old_column` is a `Symbol` or an integer then `fun` is applied to each element
(row) of `old_column`. Otherwise `old_column` can be any column indexing syntax,
in which case `fun` will be passed one argument for each of the columns specified by `old_column`.
If `ByRow` is used it is not allowed
that `old_column` selects an empty set of columns.
To apply `fun` to each row instead of whole columns, it can be wrapped in a `ByRow`
struct. In this case if `old_column` is a `Symbol` or an integer then `fun` is applied
to each element (row) of `old_column`. Otherwise `old_column` can be any column
indexing syntax, in which case `fun` will be passed one argument for each of the
columns specified by `old_column`. If `ByRow` is used it is not allowed that
`old_column` selects an empty set of columns.

Column transformation can also be specified using the short `old_column => fun` form.
In this case, `new_column_name` is automatically generated as `\$(old_column)_\$(fun)`.
Up to three column names are used for multiple input columns and they are joined
using `_`; if more than three columns are passed then the name consists of the
first two names and `etc` suffix then, e.g. `[:a,:b,:c,:d] => fun` produces
the new column name `a_b_etc_fun`.
the new column name `:a_b_etc_fun`.

If a collection of column names is passed to `select!` then requesting duplicate column
names in target data frame are accepted (e.g. `select!(df, [:a], :, r"a")` is allowed)
and only the first occurrence is used. In particular a syntax to move column `:col`
to the first position in the data frame is `select!(df, :col, :)`.
On the contrary, output column names of renaming, transformation and single column
selection operations must be unique, so e.g. `select!(df, :a, :a => :a)` or
`select!(df, :a, :a => sin => :a)` are not allowed.
`select!(df, :a, :a => ByRow(sin) => :a)` are not allowed.

Note that including the same column several times in the data frame via renaming
when `copycols=false` will create column aliases. An example of such a situation is
Expand Down Expand Up @@ -260,8 +259,7 @@ end
"""
select(df::AbstractDataFrame, inds...; copycols::Bool=true)

Create a new data frame that contains columns from `df`
specified by `inds` and return it.
Create a new data frame that contains columns from `df` specified by `inds` and return it.

Arguments passed as `inds...` can be any index that is allowed for column indexing.
In particular, regular expressions, `All`, `Between`, and `Not` selectors are supported.
Expand All @@ -271,36 +269,36 @@ are supported.

Columns can be renamed using the `old_column => new_column_name` syntax,
and transformed using the `old_column => fun => new_column_name` syntax.
`new_column_name` must be a `Symbol`, and `fun` a function or a type.
If `old_column` is a `Symbol` or an integer then `fun` is applied to a column `old_column`.
Otherwise `old_column` can be any column indexing syntax, but in this case `fun`
will be passed a `NamedTuple` holding only the columns specified by `old_column`.

It is allowed to wrap `fun` in `ByRow` struct. In this case
if `old_column` is a `Symbol` or an integer then `fun` is applied to each element
(row) of `old_column`. Otherwise `old_column` can be any column indexing syntax,
but in this case `fun` will be passed a `NamedTuple` representing each row, holding only
the columns specified by `old_column`. If `ByRow` is used it is not allowed
that `old_column` selects an empty set of columns.
`new_column_name` must be a `Symbol`, and `fun` a function or a type. If `old_column`
is a `Symbol` or an integer then `fun` is applied to the corresponding column vector.
Otherwise `old_column` can be any column indexing syntax, in which case `fun`
will be passed the column vectors specified by `old_column` as separate arguments.

To apply `fun` to each row instead of whole columns, it can be wrapped in a `ByRow`
struct. In this case if `old_column` is a `Symbol` or an integer then `fun` is applied
to each element (row) of `old_column`. Otherwise `old_column` can be any column
indexing syntax, in which case `fun` will be passed one argument for each of the
columns specified by `old_column`. If `ByRow` is used it is not allowed that
`old_column` selects an empty set of columns.

Column transformation can also be specified using the short `old_column => fun` form.
In this case, `new_column_name` is automatically generated as `\$(old_column)_\$(fun)`.
Up to three column names are used for multiple input columns and they are joined
using `_`; if more than three columns are passed then the name consists of the
first two names and `etc` suffix then, e.g. `[:a,:b,:c,:d] => fun` produces
the new column name `a_b_etc_fun`.
the new column name `:a_b_etc_fun`.

If a collection of column names is passed to `select` then requesting duplicate column
names in target data frame are accepted (e.g. `select(df, [:a], :, r"a")` is allowed)
If a collection of column names is passed to `select!` then requesting duplicate column
names in target data frame are accepted (e.g. `select!(df, [:a], :, r"a")` is allowed)
and only the first occurrence is used. In particular a syntax to move column `:col`
to the first position in the data frame is `select(df, :col, :)`.
to the first position in the data frame is `select!(df, :col, :)`.
On the contrary, output column names of renaming, transformation and single column
selection operations must be unique, so e.g. `select(df, :a, :a => :a)` or
`select(df, :a, :a => sin => :a)` are not allowed.
selection operations must be unique, so e.g. `select!(df, :a, :a => :a)` or
`select!(df, :a, :a => ByRow(sin) => :a)` are not allowed.

If `df` is a `DataFrame` a new `DataFrame` is returned.
If `copycols=true` (the default), then returned `DataFrame` is guaranteed not to share columns with `df`.
If `copycols=false`, then returned `DataFrame` shares column vectors with `df` where possible.
If `df` is a `DataFrame` a new `DataFrame` is returned. If `copycols=true` (the default),
then returned `DataFrame` is guaranteed not to share columns with `df`. If
`copycols=false`, then returned `DataFrame` shares column vectors with `df` where possible.

If `df` is a `SubDataFrame` then a `SubDataFrame` is returned if `copycols=false`
and a `DataFrame` with freshly allocated columns otherwise.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"freshly allocated" only according to the rules described above for DataFrame right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. but in particular this time we are sure we will not reuse the columns from df as SubDataFrame holds views, and we always materialize views when copycols=true.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what happens if you do e.g. :x => (x -> v) with v a global vector?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does not get copied - exactly like in DataFrame case:

julia> using DataFrames

julia> df = view(DataFrame(rand(2,3)), :, :)
2×3 SubDataFrame
│ Row │ x1       │ x2       │ x3       │
│     │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼──────────┼──────────┤
│ 1   │ 0.86855  │ 0.994078 │ 0.512417 │
│ 2   │ 0.473683 │ 0.911317 │ 0.284993 │

julia> x = [1, 2]
2-element Array{Int64,1}:
 1
 2

julia> df2 = select(df, :x1 => (y -> x) => :y)
2×1 DataFrame
│ Row │ y     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 2     │

julia> df2.y === x
true

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I understand it is best to remove "freshly allocated" right?

Expand Down Expand Up @@ -385,11 +383,14 @@ function _select(df::AbstractDataFrame, normalized_cs, copycols::Bool)
# the role of transformed_cols is the following
# * make sure that we do not use the same target column name twice in transformations;
# note though that it can appear in no-transformation selection like
# `select(df, :, :a => sin => :a), where :a is produced both by `:` and by `:a => sin => :a`
# * make sure that if some column is produced by transformation like `:a => sin => :a`
# and it appears earlier or later in non-transforming selection like `:` or `:a`
# then the transformation is computed and inserted in to the target data frame once and only once
# the first time the target column is requested to be produced.
# `select(df, :, :a => ByRow(sin) => :a), where :a is produced both by `:`
# and by `:a => ByRow(sin) => :a`
# * make sure that if some column is produced by transformation like
# `:a => ByRow(sin) => :a` and it appears earlier or later in non-transforming
# selection like `:` or `:a` then the transformation is computed and inserted
# in to the target data frame once and only once the first time the target column
# is requested to be produced.
#
# For example in:
#
# julia> df = DataFrame(a=1:2, b=3:4)
Expand All @@ -400,7 +401,7 @@ function _select(df::AbstractDataFrame, normalized_cs, copycols::Bool)
# │ 1 │ 1 │ 3 │
# │ 2 │ 2 │ 4 │
#
# julia> select(df, :, :a=>ByRow(sin)=>:a, :a, 1)
# julia> select(df, :, :a => ByRow(sin) => :a, :a, 1)
# 2×2 DataFrame
# │ Row │ a │ b │
# │ │ Float64 │ Int64 │
Expand Down