Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flexible vcat #1659

Merged
merged 82 commits into from
Apr 26, 2019
Merged
Show file tree
Hide file tree
Changes from 81 commits
Commits
Show all changes
82 commits
Select commit Hold shift + click to select a range
90d4cf4
initial commit
pdeffebach Dec 20, 2018
ad9b3b7
initial try
Dec 21, 2018
8300c28
Add `keep` option
Dec 21, 2018
1517282
Keep working
Dec 21, 2018
d9070e6
WIP initial commit
Dec 26, 2018
a52f4c7
merge to update
pdeffebach Jan 21, 2019
ae19b61
ordering of header
pdeffebach Jan 21, 2019
0bb967c
more header stuff
pdeffebach Jan 21, 2019
564087c
Work out commented to-dos
pdeffebach Feb 21, 2019
50f5ae1
minor fixes
pdeffebach Feb 23, 2019
907774c
Use iterators.flatten for correct typing
pdeffebach Feb 23, 2019
2d99d94
For switching
pdeffebach Mar 3, 2019
20f9bbd
`columns` as Milan described, plus start of docstr
pdeffebach Mar 3, 2019
791edd5
Rewording and space removal
nalimilan Mar 4, 2019
e6fd17c
Update docs and change implementation
pdeffebach Mar 9, 2019
3c4a272
Add tests
pdeffebach Mar 9, 2019
54998d4
Merge branch 'flexible_vcat' of https://github.com/pdeffebach/DataFra…
pdeffebach Mar 9, 2019
406ba08
Fix merge conflict
pdeffebach Mar 9, 2019
71fd8a4
Final fix
pdeffebach Mar 9, 2019
b7fe9f3
Small changes, ensure copy
nalimilan Mar 11, 2019
af102be
Commit pre-rebase
pdeffebach Apr 4, 2019
b525ec2
no need to rebase. fix tests.
pdeffebach Apr 4, 2019
d81efed
Progress towards empty data frame vcat
pdeffebach Apr 4, 2019
9bde8fb
Change tests to allow vcat(df, DataFrame())
pdeffebach Apr 5, 2019
1ad6e54
From :same to :equal
pdeffebach Apr 5, 2019
44113ae
initial commit
pdeffebach Dec 20, 2018
802180c
initial try
Dec 21, 2018
e9ade20
Add `keep` option
Dec 21, 2018
6790c63
Keep working
Dec 21, 2018
d64898f
WIP initial commit
Dec 26, 2018
dc591b2
ordering of header
pdeffebach Jan 21, 2019
73a676d
more header stuff
pdeffebach Jan 21, 2019
861dd18
Work out commented to-dos
pdeffebach Feb 21, 2019
b1be144
minor fixes
pdeffebach Feb 23, 2019
4f8b662
Use iterators.flatten for correct typing
pdeffebach Feb 23, 2019
506a3a4
For switching
pdeffebach Mar 3, 2019
45d8443
`columns` as Milan described, plus start of docstr
pdeffebach Mar 3, 2019
d2b4e11
Rewording and space removal
nalimilan Mar 4, 2019
4ccdda1
Update docs and change implementation
pdeffebach Mar 9, 2019
14522e1
Add tests
pdeffebach Mar 9, 2019
ed5ed6f
Final fix
pdeffebach Mar 9, 2019
c25ac46
Small changes, ensure copy
nalimilan Mar 11, 2019
2d0d831
Commit pre-rebase
pdeffebach Apr 4, 2019
8bc28b4
no need to rebase. fix tests.
pdeffebach Apr 4, 2019
5a3a6ca
Progress towards empty data frame vcat
pdeffebach Apr 4, 2019
6aa7aa8
Change tests to allow vcat(df, DataFrame())
pdeffebach Apr 5, 2019
a23dc63
From :same to :equal
pdeffebach Apr 5, 2019
01ad9ab
Merge remote-tracking branch 'pdeffebach/flexible_vcat' into flexible…
pdeffebach Apr 5, 2019
451ca2a
more git troubles
pdeffebach Apr 5, 2019
8b0b7c8
Manually add back in a few tests
pdeffebach Apr 5, 2019
6f7d1b5
MOre manual fixes
pdeffebach Apr 5, 2019
0e324bb
even more manual fixes
pdeffebach Apr 5, 2019
e34451d
Rebase fix
nalimilan Apr 5, 2019
6b24f75
final manual fixes
pdeffebach Apr 5, 2019
a42e44a
Merge branch 'flexible_vcat' of https://github.com/pdeffebach/DataFra…
pdeffebach Apr 5, 2019
c8cd4b3
fix tests
pdeffebach Apr 5, 2019
d8324a2
Continue working
pdeffebach Apr 23, 2019
1d6b01e
Fix tests
pdeffebach Apr 23, 2019
6af37c8
reduce diff
pdeffebach Apr 23, 2019
9e8ebe2
final fixes
pdeffebach Apr 23, 2019
bc6220e
Put views tests into testset
pdeffebach Apr 23, 2019
24be56a
Reduce diff
pdeffebach Apr 23, 2019
8ce9839
Merge branch 'master' into flexible_vcat
pdeffebach Apr 23, 2019
bc07df5
Update src/abstractdataframe/abstractdataframe.jl
nalimilan Apr 23, 2019
f2ccec0
Update src/abstractdataframe/abstractdataframe.jl
nalimilan Apr 23, 2019
f0d3a7a
Update src/abstractdataframe/abstractdataframe.jl
nalimilan Apr 23, 2019
6e5fba5
Update src/abstractdataframe/abstractdataframe.jl
nalimilan Apr 23, 2019
3ab3816
Update test/cat.jl
nalimilan Apr 23, 2019
995eb94
Update test/cat.jl
nalimilan Apr 23, 2019
2643142
Update test/cat.jl
nalimilan Apr 23, 2019
187f55d
Update test/cat.jl
nalimilan Apr 23, 2019
3d6c4d1
Update test/cat.jl
nalimilan Apr 23, 2019
12ffd5f
Update test/cat.jl
nalimilan Apr 23, 2019
bcdf155
Respond to milan
pdeffebach Apr 23, 2019
e1c3c9e
Merge remote-tracking branch 'pdeffebach/flexible_vcat' into flexible…
pdeffebach Apr 23, 2019
7459b1d
Respond to milan
pdeffebach Apr 23, 2019
0805b19
Update test/cat.jl
nalimilan Apr 24, 2019
c947b7a
d4 etc
pdeffebach Apr 24, 2019
b1c777d
No more columns, remove a testset
pdeffebach Apr 24, 2019
c26839e
columns -> cols
nalimilan Apr 25, 2019
671bba8
columns -> cols
nalimilan Apr 25, 2019
9661f76
layout fixes and additional tests
bkamins Apr 26, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
139 changes: 109 additions & 30 deletions src/abstractdataframe/abstractdataframe.jl
Original file line number Diff line number Diff line change
Expand Up @@ -1017,20 +1017,41 @@ Base.hcat(df1::AbstractDataFrame, df2::AbstractDataFrame, dfn::AbstractDataFrame
makeunique=makeunique, copycols=copycols)

"""
vcat(dfs::AbstractDataFrame...)
vcat(dfs::AbstractDataFrame...; cols::Union{Symbol, AbstractVector{Symbol}}=:equal)

Vertically concatenate `AbstractDataFrames`.
Vertically concatenate `AbstractDataFrame`s.

Column names in all passed data frames must be the same, but they can have
different order. In such cases the order of names in the first passed
`DataFrame` is used.
The `cols` keyword argument determines the columns of the returned data frame:

* `:equal` (the default): require all data frames to have the same column names.
nalimilan marked this conversation as resolved.
Show resolved Hide resolved
If they appear in different orders, the order of the first provided data frame is used.
* `:intersect`: only the columns present in *all* provided data frames are kept.
If the intersection is empty, an empty data frame is returned.
* `:union`: columns present in *at least one* of the provided data frames are kept.
Columns not present in some data frames are filled with `missing` where necessary.
bkamins marked this conversation as resolved.
Show resolved Hide resolved
* A vector of `Symbol`s: only listed columns are kept.
Columns not present in some data frames are filled with `missing` where necessary.
bkamins marked this conversation as resolved.
Show resolved Hide resolved

The order of columns is determined by the order they appear in the included
data frames, searching through the header of the first data frame, then the
second, etc.

The element types of columns are determined using `promote_type`,
as with `vcat` for `AbstractVector`s.

`vcat` ignores empty data frames, making it possible to initialize an empty
data frame at the beginning of a loop and `vcat` onto it.

# Example
```jldoctest
julia> df1 = DataFrame(A=1:3, B=1:3);

julia> df2 = DataFrame(A=4:6, B=4:6);

julia> df3 = DataFrame(A=7:9, C=7:9);
nalimilan marked this conversation as resolved.
Show resolved Hide resolved

julia> d4 = DataFrame();

julia> vcat(df1, df2)
6×2 DataFrame
│ Row │ A │ B │
Expand All @@ -1042,47 +1063,105 @@ julia> vcat(df1, df2)
│ 4 │ 4 │ 4 │
│ 5 │ 5 │ 5 │
│ 6 │ 6 │ 6 │

julia> vcat(df1, df3, cols=:union)
6×3 DataFrame
│ Row │ A │ B │ C │
│ │ Int64 │ Int64⍰ │ Int64⍰ │
├─────┼───────┼─────────┼─────────┤
│ 1 │ 1 │ 1 │ missing │
│ 2 │ 2 │ 2 │ missing │
│ 3 │ 3 │ 3 │ missing │
│ 4 │ 7 │ missing │ 7 │
│ 5 │ 8 │ missing │ 8 │
│ 6 │ 9 │ missing │ 9 │

julia> vcat(df1, df3, cols=:intersect)
6×1 DataFrame
│ Row │ A │
│ │ Int64 │
├─────┼───────┤
│ 1 │ 1 │
│ 2 │ 2 │
│ 3 │ 3 │
│ 4 │ 7 │
│ 5 │ 8 │
│ 6 │ 9 │

julia> vcat(d4, df1)
3×2 DataFrame
│ Row │ A │ B │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 1 │ 1 │
│ 2 │ 2 │ 2 │
│ 3 │ 3 │ 3 │
```

"""
Base.vcat(df::AbstractDataFrame) = DataFrame(df)
Base.vcat(dfs::AbstractDataFrame...) = _vcat(collect(dfs))
function _vcat(dfs::AbstractVector{<:AbstractDataFrame})
Base.vcat(dfs::AbstractDataFrame...;
cols::Union{Symbol, AbstractVector{Symbol}}=:equal) =
_vcat([df for df in collect(dfs) if ncol(df) != 0]; cols=cols)

function _vcat(dfs::AbstractVector{<:AbstractDataFrame};
cols::Union{Symbol, AbstractVector{Symbol}}=:equal)

pdeffebach marked this conversation as resolved.
Show resolved Hide resolved
isempty(dfs) && return DataFrame()
# Array of all headers
allheaders = map(names, dfs)
# Array of unique headers across all data frames
uniqueheaders = unique(allheaders)
# All symbols present across all headers
unionunique = union(uniqueheaders...)
# List of symbols present in all dataframes
intersectunique = intersect(uniqueheaders...)
coldiff = setdiff(unionunique, intersectunique)

if !isempty(coldiff)
# if any DataFrames are a full superset of names, skip them
filter!(u -> Set(u) != Set(unionunique), uniqueheaders)
estrings = Vector{String}(undef, length(uniqueheaders))
for (i, u) in enumerate(uniqueheaders)
matching = findall(h -> u == h, allheaders)
headerdiff = setdiff(coldiff, u)
cols = join(headerdiff, ", ", " and ")
args = join(matching, ", ", " and ")
estrings[i] = "column(s) $cols are missing from argument(s) $args"
end

if cols === :equal
header = unionunique
coldiff = setdiff(unionunique, intersectunique)

if !isempty(coldiff)
# if any DataFrames are a full superset of names, skip them
filter!(u -> !issetequal(u, header), uniqueheaders)
estrings = map(enumerate(uniqueheaders)) do (i, head)
matching = findall(h -> head == h, allheaders)
headerdiff = setdiff(coldiff, head)
cols = join(headerdiff, ", ", " and ")
args = join(matching, ", ", " and ")
return "column(s) $cols are missing from argument(s) $args"
end
throw(ArgumentError(join(estrings, ", ", ", and ")))
end

nalimilan marked this conversation as resolved.
Show resolved Hide resolved
elseif cols === :intersect
header = intersectunique
elseif cols === :union
header = unionunique
else
header = cols
end

header = allheaders[1]
length(header) == 0 && return DataFrame()
cols = Vector{AbstractVector}(undef, length(header))
all_cols = Vector{AbstractVector}(undef, length(header))
for (i, name) in enumerate(header)
data = [df[name] for df in dfs]
lens = map(length, data)
T = mapreduce(eltype, promote_type, data)
cols[i] = Tables.allocatecolumn(T, sum(lens))
newcols = map(dfs) do df
if haskey(df, name)
return df[name]
else
Iterators.repeated(missing, nrow(df))
end
end

lens = map(length, newcols)
T = mapreduce(eltype, promote_type, newcols)
all_cols[i] = Tables.allocatecolumn(T, sum(lens))
offset = 1
for j in 1:length(data)
copyto!(cols[i], offset, data[j])
for j in 1:length(newcols)
copyto!(all_cols[i], offset, newcols[j])
offset += lens[j]
end
end
return DataFrame(cols, header, copycols=false)
return DataFrame(all_cols, header, copycols=false)
end

function Base.reduce(::typeof(vcat), dfs::AbstractVector{<:AbstractDataFrame})
Expand Down
92 changes: 66 additions & 26 deletions test/cat.jl
Original file line number Diff line number Diff line change
Expand Up @@ -269,16 +269,19 @@ end

@test vcat(missing_df) == DataFrame()
@test vcat(missing_df, missing_df) == DataFrame()
@test_throws ArgumentError vcat(missing_df, df)
@test_throws ArgumentError vcat(df, missing_df)
@test vcat(missing_df) == DataFrame()
@test vcat(missing_df, missing_df) == DataFrame()
@test vcat(missing_df, df) == df
@test vcat(df, missing_df) == df
@test eltypes(vcat(df, df)) == Type[Float64, Float64, Int]
@test size(vcat(df, df)) == (size(df, 1) * 2, size(df, 2))
res = vcat(df, df)
@test res[1:size(df, 1), :] == df
@test res[1+size(df, 1):end, :] == df
@test eltypes(vcat(df, df, df)) == Type[Float64, Float64, Int]
@test size(vcat(df, df, df)) == (size(df, 1) * 3, size(df, 2))
@test res[(1+size(df, 1)):end, :] == df
res = vcat(df, df, df)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this line above to avoid calling vcat(df, df, df) three times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@test eltypes(res) == Type[Float64, Float64, Int]
@test size(res) == (size(df, 1) * 3, size(df, 2))

s = size(df, 1)
for i in 1:3
@test res[1+(i-1)*s:i*s, :] == df
Expand Down Expand Up @@ -348,24 +351,26 @@ end
@testset "vcat out of order" begin
df1 = DataFrame(A = 1:3, B = 4:6, C = 7:9)
df2 = DataFrame([2x for x in eachcol(df1)], reverse(names(df1)))
@test vcat(df1, df2) == DataFrame([[1, 2, 3, 14, 16, 18],
[4, 5, 6, 8, 10, 12],
[7, 8, 9, 2, 4, 6]], [:A, :B, :C])
@test vcat(df1, df1, df2) == DataFrame([[1, 2, 3, 1, 2, 3, 14, 16, 18],
[4, 5, 6, 4, 5, 6, 8, 10, 12],
[7, 8, 9, 7, 8, 9, 2, 4, 6]], [:A, :B, :C])
@test vcat(df1, df2, df2) == DataFrame([[1, 2, 3, 14, 16, 18, 14, 16, 18],
[4, 5, 6, 8, 10, 12, 8, 10, 12],
[7, 8, 9, 2, 4, 6, 2, 4, 6]], [:A, :B, :C])
@test vcat(df2, df1, df2) == DataFrame([[2, 4, 6, 7, 8, 9, 2, 4, 6],
[8, 10, 12, 4, 5, 6, 8, 10, 12],
[14, 16, 18, 1, 2, 3, 14, 16, 18]] ,[:C, :B, :A])

@test vcat(df1, df2) == DataFrame(A = [1, 2, 3, 14, 16, 18],
B = [4, 5, 6, 8, 10, 12],
C = [7, 8, 9, 2, 4, 6])
# test with cols keyword argument
@test vcat(df1, df2, cols = :equal) == DataFrame(A = [1, 2, 3, 14, 16, 18],
B = [4, 5, 6, 8, 10, 12],
C = [7, 8, 9, 2, 4, 6])
@test vcat(df1, df1, df2) == DataFrame(A = [1, 2, 3, 1, 2, 3, 14, 16, 18],
B = [4, 5, 6, 4, 5, 6, 8, 10, 12],
C = [7, 8, 9, 7, 8, 9, 2, 4, 6])
@test vcat(df1, df2, df2) == DataFrame(A = [1, 2, 3, 14, 16, 18, 14, 16, 18],
B = [4, 5, 6, 8, 10, 12, 8, 10, 12],
C = [7, 8, 9, 2, 4, 6, 2, 4, 6])
@test vcat(df2, df1, df2) == DataFrame(C = [2, 4, 6, 7, 8, 9, 2, 4, 6],
B = [8, 10, 12, 4, 5, 6, 8, 10, 12],
A = [14, 16, 18, 1, 2, 3, 14, 16, 18])
@test size(vcat(df1, df1, df1, df2, df2, df2)) == (18, 3)
df3 = df1[[1, 3, 2]]
res = vcat(df1, df1, df1, df2, df2, df2, df3, df3, df3, df3)
@test res == reduce(vcat, [df1, df1, df1, df2, df2, df2, df3, df3, df3, df3])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why that got left out.


@test size(res) == (30, 3)
@test res[1:3,:] == df1
@test res[4:6,:] == df1
Expand All @@ -382,11 +387,46 @@ end
@test [df1; df2] == df3 == reduce(vcat, [df1, df2])
end

@testset "vcat with cols=:union" begin
df1 = DataFrame(A = 1:3, B = 4:6)
df2 = DataFrame(A = 7:9)
df3 = DataFrame(B = 4:6, A = 1:3)

@test vcat(df1, df2; cols = :union) ≅
DataFrame(A = [1, 2, 3, 7, 8, 9],
B = [4, 5, 6, missing, missing, missing])
@test vcat(df1, df2, df3; cols = :union) ≅
DataFrame(A = [1, 2, 3, 7, 8, 9, 1, 2, 3],
B = [4, 5, 6, missing, missing, missing, 4, 5, 6])
end

@testset "vcat with cols=:intersect" begin
df1 = DataFrame(A = 1:3, B = 4:6)
df2 = DataFrame(A = 7:9)
df3 = DataFrame(A = 10:12, C = 13:15)

nalimilan marked this conversation as resolved.
Show resolved Hide resolved
@test vcat(df1, df2; cols = :intersect) ≅ DataFrame(A = [1, 2, 3, 7, 8, 9])
@test vcat(df1, df2, df3; cols = :intersect) ≅ DataFrame(A = [1, 2, 3, 7, 8, 9,
10, 11, 12])
end

@testset "vcat with cols::Vector" begin
df1 = DataFrame(A = 1:3, B = 4:6)
df2 = DataFrame(A = 7:9)
df3 = DataFrame(A = 10:12, C = 13:15)

@test vcat(df1, df2; cols = [:A, :B, :C]) ≅
DataFrame(A = [1, 2, 3, 7, 8, 9],
B = [4, 5, 6, missing, missing, missing],
C = [missing, missing, missing, missing, missing, missing])

@test vcat(df1, df2, df3; cols = [:A, :B, :C]) ≅
DataFrame(A = [1, 2, 3, 7, 8, 9, 10, 11, 12],
B = [4, 5, 6, missing, missing, missing, missing, missing, missing],
C = [missing, missing, missing, missing, missing, missing, 13, 14, 15])
end

@testset "vcat errors" begin
err = @test_throws ArgumentError vcat(DataFrame(), DataFrame(), DataFrame(x=[]))
@test err.value.msg == "column(s) x are missing from argument(s) 1 and 2"
err = @test_throws ArgumentError vcat(DataFrame(), DataFrame(), DataFrame(x=[1]))
@test err.value.msg == "column(s) x are missing from argument(s) 1 and 2"
df1 = DataFrame(A = 1:3, B = 1:3)
df2 = DataFrame(A = 1:3)
# right missing 1 column
Expand All @@ -396,10 +436,9 @@ end
err = @test_throws ArgumentError vcat(df2, df1)
@test err.value.msg == "column(s) B are missing from argument(s) 1"
# multiple missing 1 column
err = @test_throws ArgumentError vcat(df1, df2, df2, df2, df2, df2)
err1 = @test_throws ArgumentError vcat(df1, df2, df2, df2, df2, df2)
err2 = @test_throws ArgumentError reduce(vcat, [df1, df2, df2, df2, df2, df2])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oxinabox I think my rebase may have collided with your optimization for reduce(vcat, dfs). Should this throw an error or not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If vcat(xs...) throws an error, then reduce(vcat, xs) should throw the same error.
If not, then not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay then this error should remain deleted. It's good to go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this still throw an error after the PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed what this error was about. I have added it back in.

@test err == err2
@test err.value.msg == "column(s) B are missing from argument(s) 2, 3, 4, 5 and 6"
@test err1.value.msg == err2.value.msg == "column(s) B are missing from argument(s) 2, 3, 4, 5 and 6"
# argument missing >1 columns
df1 = DataFrame(A = 1:3, B = 1:3, C = 1:3, D = 1:3, E = 1:3)
err = @test_throws ArgumentError vcat(df1, df2)
Expand Down Expand Up @@ -438,6 +477,7 @@ end
err = @test_throws ArgumentError vcat(df1, df2, df3, df4, df1, df2, df3, df4, df1, df2, df3, df4)
@test err.value.msg == "column(s) E and F are missing from argument(s) 1, 5 and 9, column(s) B are missing from argument(s) 2, 6 and 10, and column(s) F are missing from argument(s) 3, 7 and 11"
end

x = view(DataFrame(A = Vector{Union{Missing, Int}}(1:3)), 2:2, :)
y = DataFrame(A = 4:5)
@test vcat(x, y) == DataFrame(A = [2, 4, 5]) == reduce(vcat, [x, y])
Expand Down