Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flexible vcat #1659

Merged
merged 82 commits into from
Apr 26, 2019
Merged
Changes from 8 commits
Commits
Show all changes
82 commits
Select commit Hold shift + click to select a range
90d4cf4
initial commit
pdeffebach Dec 20, 2018
ad9b3b7
initial try
Dec 21, 2018
8300c28
Add `keep` option
Dec 21, 2018
1517282
Keep working
Dec 21, 2018
d9070e6
WIP initial commit
Dec 26, 2018
a52f4c7
merge to update
pdeffebach Jan 21, 2019
ae19b61
ordering of header
pdeffebach Jan 21, 2019
0bb967c
more header stuff
pdeffebach Jan 21, 2019
564087c
Work out commented to-dos
pdeffebach Feb 21, 2019
50f5ae1
minor fixes
pdeffebach Feb 23, 2019
907774c
Use iterators.flatten for correct typing
pdeffebach Feb 23, 2019
2d99d94
For switching
pdeffebach Mar 3, 2019
20f9bbd
`columns` as Milan described, plus start of docstr
pdeffebach Mar 3, 2019
791edd5
Rewording and space removal
nalimilan Mar 4, 2019
e6fd17c
Update docs and change implementation
pdeffebach Mar 9, 2019
3c4a272
Add tests
pdeffebach Mar 9, 2019
54998d4
Merge branch 'flexible_vcat' of https://github.com/pdeffebach/DataFra…
pdeffebach Mar 9, 2019
406ba08
Fix merge conflict
pdeffebach Mar 9, 2019
71fd8a4
Final fix
pdeffebach Mar 9, 2019
b7fe9f3
Small changes, ensure copy
nalimilan Mar 11, 2019
af102be
Commit pre-rebase
pdeffebach Apr 4, 2019
b525ec2
no need to rebase. fix tests.
pdeffebach Apr 4, 2019
d81efed
Progress towards empty data frame vcat
pdeffebach Apr 4, 2019
9bde8fb
Change tests to allow vcat(df, DataFrame())
pdeffebach Apr 5, 2019
1ad6e54
From :same to :equal
pdeffebach Apr 5, 2019
44113ae
initial commit
pdeffebach Dec 20, 2018
802180c
initial try
Dec 21, 2018
e9ade20
Add `keep` option
Dec 21, 2018
6790c63
Keep working
Dec 21, 2018
d64898f
WIP initial commit
Dec 26, 2018
dc591b2
ordering of header
pdeffebach Jan 21, 2019
73a676d
more header stuff
pdeffebach Jan 21, 2019
861dd18
Work out commented to-dos
pdeffebach Feb 21, 2019
b1be144
minor fixes
pdeffebach Feb 23, 2019
4f8b662
Use iterators.flatten for correct typing
pdeffebach Feb 23, 2019
506a3a4
For switching
pdeffebach Mar 3, 2019
45d8443
`columns` as Milan described, plus start of docstr
pdeffebach Mar 3, 2019
d2b4e11
Rewording and space removal
nalimilan Mar 4, 2019
4ccdda1
Update docs and change implementation
pdeffebach Mar 9, 2019
14522e1
Add tests
pdeffebach Mar 9, 2019
ed5ed6f
Final fix
pdeffebach Mar 9, 2019
c25ac46
Small changes, ensure copy
nalimilan Mar 11, 2019
2d0d831
Commit pre-rebase
pdeffebach Apr 4, 2019
8bc28b4
no need to rebase. fix tests.
pdeffebach Apr 4, 2019
5a3a6ca
Progress towards empty data frame vcat
pdeffebach Apr 4, 2019
6aa7aa8
Change tests to allow vcat(df, DataFrame())
pdeffebach Apr 5, 2019
a23dc63
From :same to :equal
pdeffebach Apr 5, 2019
01ad9ab
Merge remote-tracking branch 'pdeffebach/flexible_vcat' into flexible…
pdeffebach Apr 5, 2019
451ca2a
more git troubles
pdeffebach Apr 5, 2019
8b0b7c8
Manually add back in a few tests
pdeffebach Apr 5, 2019
6f7d1b5
MOre manual fixes
pdeffebach Apr 5, 2019
0e324bb
even more manual fixes
pdeffebach Apr 5, 2019
e34451d
Rebase fix
nalimilan Apr 5, 2019
6b24f75
final manual fixes
pdeffebach Apr 5, 2019
a42e44a
Merge branch 'flexible_vcat' of https://github.com/pdeffebach/DataFra…
pdeffebach Apr 5, 2019
c8cd4b3
fix tests
pdeffebach Apr 5, 2019
d8324a2
Continue working
pdeffebach Apr 23, 2019
1d6b01e
Fix tests
pdeffebach Apr 23, 2019
6af37c8
reduce diff
pdeffebach Apr 23, 2019
9e8ebe2
final fixes
pdeffebach Apr 23, 2019
bc6220e
Put views tests into testset
pdeffebach Apr 23, 2019
24be56a
Reduce diff
pdeffebach Apr 23, 2019
8ce9839
Merge branch 'master' into flexible_vcat
pdeffebach Apr 23, 2019
bc07df5
Update src/abstractdataframe/abstractdataframe.jl
nalimilan Apr 23, 2019
f2ccec0
Update src/abstractdataframe/abstractdataframe.jl
nalimilan Apr 23, 2019
f0d3a7a
Update src/abstractdataframe/abstractdataframe.jl
nalimilan Apr 23, 2019
6e5fba5
Update src/abstractdataframe/abstractdataframe.jl
nalimilan Apr 23, 2019
3ab3816
Update test/cat.jl
nalimilan Apr 23, 2019
995eb94
Update test/cat.jl
nalimilan Apr 23, 2019
2643142
Update test/cat.jl
nalimilan Apr 23, 2019
187f55d
Update test/cat.jl
nalimilan Apr 23, 2019
3d6c4d1
Update test/cat.jl
nalimilan Apr 23, 2019
12ffd5f
Update test/cat.jl
nalimilan Apr 23, 2019
bcdf155
Respond to milan
pdeffebach Apr 23, 2019
e1c3c9e
Merge remote-tracking branch 'pdeffebach/flexible_vcat' into flexible…
pdeffebach Apr 23, 2019
7459b1d
Respond to milan
pdeffebach Apr 23, 2019
0805b19
Update test/cat.jl
nalimilan Apr 24, 2019
c947b7a
d4 etc
pdeffebach Apr 24, 2019
b1c777d
No more columns, remove a testset
pdeffebach Apr 24, 2019
c26839e
columns -> cols
nalimilan Apr 25, 2019
671bba8
columns -> cols
nalimilan Apr 25, 2019
9661f76
layout fixes and additional tests
bkamins Apr 26, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
85 changes: 69 additions & 16 deletions src/abstractdataframe/abstractdataframe.jl
Original file line number Diff line number Diff line change
Expand Up @@ -983,35 +983,88 @@ julia> vcat(df1, df2)
│ 6 │ 6 │ 6 │
```
"""
Base.vcat(df::AbstractDataFrame) = df
nalimilan marked this conversation as resolved.
Show resolved Hide resolved
Base.vcat(dfs::AbstractDataFrame...) = _vcat(collect(dfs))
function _vcat(dfs::AbstractVector{<:AbstractDataFrame})
Base.vcat(df::AbstractDataFrame;
widen::Bool = false,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix indentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no longer needed.

keep::Union{Nothing, Vector{Symbol}} = nothing) = df

Base.vcat(dfs::AbstractDataFrame...;
widen::Bool = false,
missing = missing,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This argument isn't used.

keep::Union{Nothing, Vector{Symbol}} = nothing) =
_vcat(collect(dfs); widen = widen, keep = keep)

function _vcat(dfs::AbstractVector{<:AbstractDataFrame};
widen::Bool = false,
keep::Union{Nothing, Vector{Symbol}} = nothing)

isempty(dfs) && return DataFrame()
allheaders = map(names, dfs)
uniqueheaders = unique(allheaders)
unionunique = union(uniqueheaders...)
intersectunique = intersect(uniqueheaders...)
coldiff = setdiff(unionunique, intersectunique)

if !isempty(coldiff)
# array of all headers
@show allheaders = map(names, dfs)
# unique arrays of all headers
@show uniqueheaders = unique(allheaders)
# Array of all the unique headers
@show unionunique = union(uniqueheaders...)
# Intersection of all unique headers
@show intersectunique = intersect(uniqueheaders...)
# get the elements that are not present in everything
@show coldiff = setdiff(unionunique, intersectunique)

@show keep
if (widen == false) && !isempty(coldiff) && keep == nothing
# if any DataFrames are a full superset of names, skip them
filter!(u -> Set(u) != Set(unionunique), uniqueheaders)
estrings = Vector{String}(undef, length(uniqueheaders))
for (i, u) in enumerate(uniqueheaders)
matching = findall(h -> u == h, allheaders)
headerdiff = setdiff(coldiff, u)
cols = join(headerdiff, ", ", " and ")
args = join(matching, ", ", " and ")
@show matching = findall(h -> u == h, allheaders)
@show headerdiff = setdiff(coldiff, u)
@show cols = join(headerdiff, ", ", " and ")
@show args = join(matching, ", ", " and ")
estrings[i] = "column(s) $cols are missing from argument(s) $args"
end
throw(ArgumentError(join(estrings, ", ", ", and ")))
end

header = allheaders[1]
# TODO: make sure that `keep` throws a good error if
# it a) isn't in `allheaders` or b) isn't a subset of `unionunique`
header = (keep == nothing) ? unionunique : keep

nalimilan marked this conversation as resolved.
Show resolved Hide resolved
if keep == nothing
# Make the order of the names match the order of the dataframes inputted
header = let unionunique = unionunique, allheaders = allheaders
t = [filter(h -> h in unionunique, head) for head in allheaders]
reduce((a, b) -> [a; setdiff(b, a)], t)
end
else
header = keep
end

length(header) == 0 && return DataFrame()
cols = Vector{AbstractVector}(undef, length(header))
for (i, name) in enumerate(header)
data = [df[name] for df in dfs]
# TODO: replace with commented out code after getindex deprecation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not looked at the PR in detail - but it seems you re-introduce some old code that was already removed here (at least in the comments)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted.

# data = [df[name] for df in dfs]
# the code below assumes that only DataFrame and SubDataFrame
# are subtypes of AbstractDataFrame
# it should be removed ASAP after deprecation
data = map(dfs) do df
if df isa DataFrame
if haskey(df, name)
return df[name]
else
# TODO: make this more efficient by not creating a
# full array of missing values. Instead, implement
# this in the copyto! stage
return fill(missing, nrow(df))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, I had missed the notification. RepeatedVector can probably help here (if using a generator isn't enough).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh why did I not think of a generator. It's fixed now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out that a generator doesn't infer eltype correclty. Columns are of type Any rather than Union{T, Missing}.

RepeatedVector doesn't work because it's not defined at this point in the code, so we can't use it here. (Should we have a place for all struct defintions?)

Let me know what to do next. I think there is an issue for this generator issue but I can't find it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. Maybe with Iterators.repeated? (eltype won't be defined for generators since it needs inference.)

end
else
if haskey(df, name)
return view(parent(df)[name], rows(df))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for SubDataFrame? Then df[name] should be enough and you can drop the branch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this will work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do not need that branch. Anyway data will be type unstable I think (but probably it is best to check it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getting rid of the branch works fine.

else
return fill(missing, length(rows(df)))
end
end
end

lens = map(length, data)
T = mapreduce(eltype, promote_type, data)
cols[i] = Tables.allocatecolumn(T, sum(lens))
Expand Down