Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add transformation and renaming to select and select! #2080

Merged
merged 51 commits into from
Mar 19, 2020
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
77f4623
add support for transforms in select and define transform and transform!
bkamins Jan 6, 2020
147a427
fix SubDataFrame select signature
bkamins Jan 6, 2020
11fd0a2
fix problem in autogeneration of column names
bkamins Jan 6, 2020
6fa4f84
add documentation of automatic generation of column names
bkamins Jan 7, 2020
d501fb4
improvements after code review
bkamins Jan 8, 2020
7053e5b
updates after a code review
bkamins Jan 9, 2020
ec834e2
correct variable name
bkamins Jan 10, 2020
6c76aca
minor fix
bkamins Jan 10, 2020
e59d129
fix select for SubDataFrame
bkamins Jan 10, 2020
dee8ac7
improved multiple column transformation
bkamins Jan 10, 2020
4a8a40b
improve select for SubDataFrame
bkamins Jan 10, 2020
f04a549
Apply suggestions from code review
bkamins Jan 10, 2020
bbc06f4
fixes after code review
bkamins Jan 10, 2020
498d9df
fixes from code review
bkamins Jan 12, 2020
fa5a1f1
disallow duplicates in single column selection
bkamins Jan 15, 2020
cd8f41b
fix select for SubDataFrame to avoid duplicate ColumnIndex selelctions
bkamins Jan 15, 2020
3c7149b
Apply suggestions from code review
bkamins Jan 15, 2020
807adfc
fixes after the code review
bkamins Jan 16, 2020
aa7746b
change default behavior to whole-column and add Row
bkamins Feb 1, 2020
7524706
fix typo
bkamins Feb 4, 2020
3d77f6b
add funname to Row
bkamins Feb 5, 2020
e560a14
merge normalize_selection methods
bkamins Feb 5, 2020
9caab2d
make ByRow a functor
bkamins Feb 11, 2020
db8f103
Update src/abstractdataframe/selection.jl
bkamins Feb 14, 2020
df6795a
disallow transofmation of 0 columns
bkamins Feb 14, 2020
ba1feb9
disallow 0 columns only in ByRow
bkamins Feb 15, 2020
0c30db7
Merge branch 'master' into flexible_select
bkamins Feb 15, 2020
6d03a1c
sync with Tables 1.0
bkamins Feb 15, 2020
34aa4cd
fix documentation
bkamins Feb 15, 2020
a03afd7
fix missing parenthesis
bkamins Feb 16, 2020
d4fced0
fix method signature
bkamins Feb 17, 2020
c712088
export ByRow
bkamins Feb 17, 2020
9b5c027
auto-splat (no docs update)
bkamins Feb 22, 2020
8e73abc
fix @views
bkamins Feb 22, 2020
930875e
move to broadcasting in ByRow
bkamins Feb 26, 2020
ab4103a
Apply suggestions from code review
bkamins Feb 28, 2020
4289c48
update implementation
bkamins Feb 28, 2020
6341ccc
reorganize tests
bkamins Feb 28, 2020
09e632e
first round of tests
bkamins Feb 28, 2020
df59216
disallow AbstractDataFrame, NamedTuple, DataFrameRow, and AbstractMat…
bkamins Feb 29, 2020
d932b05
fix test
bkamins Feb 29, 2020
688b077
clean up transformation implementation
bkamins Mar 1, 2020
c34ee72
further sanitizing select rules and more code explanations
bkamins Mar 1, 2020
b818d57
fix comments
bkamins Mar 1, 2020
08d4043
tests of disallowed values
bkamins Mar 2, 2020
49dff0e
finalize tests
bkamins Mar 2, 2020
d685576
fix Julia 1.0 tests
bkamins Mar 2, 2020
35f8996
stop doing pessimistic copy when copycols=true
bkamins Mar 12, 2020
78b492d
Apply suggestions from code review
bkamins Mar 18, 2020
52e690d
fixes after code review
bkamins Mar 18, 2020
20642c5
improve docstring
bkamins Mar 19, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 28 additions & 10 deletions src/abstractdataframe/selection.jl
Original file line number Diff line number Diff line change
Expand Up @@ -139,18 +139,19 @@ function select_transform!(nc:: Pair{<:Union{Int, AbstractVector{Int}},
transformed_cols::Dict{Symbol, Any}, copycols::Bool)
col_idx, (fun, newname) = nc
@assert !hasproperty(newdf, newname)
cdf = eachcol(df)
if col_idx isa Int
res = fun(df[!, col_idx])
else
cdf = eachcol(df)
res = fun((cdf[i] for i in col_idx)...)
end
if res isa Union{AbstractDataFrame, NamedTuple, DataFrameRow, AbstractMatrix}
throw(ArgumentError("return value from function $fun " *
"of type $(typeof(res)) is currently not allowed."))
end
if res isa AbstractVector
if copycols && !(first(fun isa ByRow))
if copycols && !(fun isa ByRow) && (res isa SubArray ||
any(i -> parent(res) === parent(cdf[i]), col_idx))
bkamins marked this conversation as resolved.
Show resolved Hide resolved
newdf[!, newname] = copy(res)
else
newdf[!, newname] = res
Expand Down Expand Up @@ -203,8 +204,8 @@ selection operations must be unique, so e.g. `select!(df, :a, :a => :a)` or
`select!(df, :a, :a => ByRow(sin) => :a)` are not allowed.

Note that including the same column several times in the data frame via renaming
when `copycols=false` will create column aliases. An example of such a situation is
`select!(df, :a, :a => :b, :a => :c, copycols=false)`.
or transformations that do not allocate will create column aliases.
bkamins marked this conversation as resolved.
Show resolved Hide resolved
An example of such a situation is `select!(df, :a, :a => :b, :a => identity => :c)`.

# Examples
```jldoctest
Expand Down Expand Up @@ -342,16 +343,33 @@ On the contrary, output column names of renaming, transformation and single colu
selection operations must be unique, so e.g. `select!(df, :a, :a => :a)` or
`select!(df, :a, :a => ByRow(sin) => :a)` are not allowed.

If `df` is a `DataFrame` a new `DataFrame` is returned. If `copycols=true` (the default),
then returned `DataFrame` is guaranteed not to share columns with `df`. If
`copycols=false`, then returned `DataFrame` shares column vectors with `df` where possible.
If `df` is a `DataFrame` a new `DataFrame` is returned.
If `copycols=false`, then returned `DataFrame` shares column vectors with `df` where possible.
If `copycols=true` (the default), then returned `DataFrame` will not share columns with `df`.
The only exception for this rule is `old_column => fun => new_column_name` transformation
when `fun` returns a vector that is not allocated by `fun` but at the same time it is neither
a vector derived from a vector passed in `old_column` nor it is a `SubArray`.
In such a case a new `DataFrame` might contain aliases. Such a situation might happen eg.
in the following code
```jldoctest
julia> df = DataFrame(a=1:3, b=4:6);

julia> c = [7, 8, 9];

julia> df2 = select(df, :a => (x -> c) => :c1, :b => (x -> c) => :c2);
```
Now `df2` contains columns `:c1` and `:c2` that are aliases although we have used
`copycols=true` in `select` (which is a default).
Although this is allowed, such style of usage of the `select` function is discouraged,
normally `fun` in `old_column => fun => new_column_name` should allocate a fresh vector.
bkamins marked this conversation as resolved.
Show resolved Hide resolved

If `df` is a `SubDataFrame` then a `SubDataFrame` is returned if `copycols=false`
and a `DataFrame` with freshly allocated columns otherwise.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"freshly allocated" only according to the rules described above for DataFrame right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. but in particular this time we are sure we will not reuse the columns from df as SubDataFrame holds views, and we always materialize views when copycols=true.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what happens if you do e.g. :x => (x -> v) with v a global vector?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it does not get copied - exactly like in DataFrame case:

julia> using DataFrames

julia> df = view(DataFrame(rand(2,3)), :, :)
2×3 SubDataFrame
│ Row │ x1       │ x2       │ x3       │
│     │ Float64  │ Float64  │ Float64  │
├─────┼──────────┼──────────┼──────────┤
│ 1   │ 0.86855  │ 0.994078 │ 0.512417 │
│ 2   │ 0.473683 │ 0.911317 │ 0.284993 │

julia> x = [1, 2]
2-element Array{Int64,1}:
 1
 2

julia> df2 = select(df, :x1 => (y -> x) => :y)
2×1 DataFrame
│ Row │ y     │
│     │ Int64 │
├─────┼───────┤
│ 1   │ 1     │
│ 2   │ 2     │

julia> df2.y === x
true

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I understand it is best to remove "freshly allocated" right?


Note that including the same column several times in the data frame via renaming
when `copycols=false` will create column aliases. An example of such a situation is
`select(df, :a, :a => :b, :a => :c, copycols=false)`.
Note that including the same column several times in the data frame via renaming or
transformations that do not allocate when `copycols=false` will create column aliases.
An example of such a situation is
`select(df, :a, :a => :b, :a => identity => :c, copycols=false)`.

# Examples
```jldoctest
Expand Down
24 changes: 23 additions & 1 deletion test/select.jl
Original file line number Diff line number Diff line change
Expand Up @@ -733,7 +733,7 @@ end

@test select(df, r"z") == DataFrame()
@test select(df, r"z" => () -> x) == DataFrame(_function=x)
@test select(df, r"z" => () -> x)[!, 1] !== x
@test select(df, r"z" => () -> x)[!, 1] === x # no copy even for copycols=true
@test_throws MethodError select(df, r"z" => x -> 1)
@test_throws ArgumentError select(df, r"z" => ByRow(rand))

Expand Down Expand Up @@ -813,4 +813,26 @@ end
@test_throws ArgumentError select(sdf, :x1 => identity => :r1, copycols=false)
end

@testset "copycols special cases" begin
df = DataFrame(a=1:3, b=4:6)
c = [7, 8]
df2 = select(df, :a => (x -> c) => :c1, :b => (x -> c) => :c2)
@test df2.c1 === df2.c2
df2 = select(df, :a => identity => :c1, :a => :c2)
@test df2.c1 !== df2.c2
df2 = select(df, :a => identity => :c1)
@test df2.c1 !== df.a
df2 = select(df, :a => (x -> df.b) => :c1)
@test df2.c1 === df.b
df2 = select(view(df, 1:2, :), :a => parent => :c1)
@test df2.c1 !== df.a
df2 = select(view(df, 1:2, :), :a => (x -> view(x, 1:1)) => :c1)
@test df2.c1 isa Vector
df2 = select(df, :a, :a => :b, :a => identity => :c, copycols=false)
@test df2.b === df2.c == df.a
bkamins marked this conversation as resolved.
Show resolved Hide resolved
a = df.a
select!(df, :a, :a => :b, :a => identity => :c)
@test df.b === df.c == a
end

end # module