-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flexible vcat #1659
Flexible vcat #1659
Conversation
Thanks. Do we really need |
I think
Now you want to append but you only care about the columns In a practical example, in Stata, I often have 2 dataset from different endline surveys, and I only care about questions that are common to both surveys, but there are like 200 questions so I don't want to bother with which ones are and are not in the Endline 1 survey. Even if I could subset the second survey, I wouldn't want to have to figure out which questions are in both and which are not. I'm fine to remove EDIT: Or, rather, I want to keep all the questions from the second endline, but not any that are only in the first endline. I think that's the scenario that is difficult... I also didn't want any default whereby the columns from the first DataFrame in the arguments is kept, because that seemed arbitrary. |
Then maybe what's needed is actually an argument which can be |
Yeah that could be consistent. The issue I found is that everything in |
I guess |
With more than 2 arguments, we could encourage people to use |
Coming over from the reference PR #1672 above. A couple of initial thoughts out loud regarding both workstreams:
Still digging through everything, but that's what comes to mind so far. |
I'm also seeing a lot of hard-coded test code for non-intersecting columns on master when calling There seems to be a lot more to consider than I initially thought, oof. |
Note that
I don't think there's a problem. We just need to allocate new columns using
We've discussed these things a lot before in a Discourse thread. I prefer a single function since it's a standard Julia name and it's easier to discover. There's no reason why making the function more flexible should break code. |
So here's what I've managed to get to a WIP stage, myself, based on all the information you've all provided. What I've changed (from master) is:
Were this to be ideal, it still needs at least a few things in my mind:
I'm of course open to feedback! |
Could you explain how that differs from the current implementation in the current state of this PR? @nalimilan How we should we decide whether to use a strategy that mimics |
It appears that I am quite dumb. I was getting weird merge conflicts when looking at this PR's branch vs. master, so I just tried to recreate it a different way. But I got it sorted out now, and the state of this PR seems nicer. To answer your question: the functionality is the same, but mine was more verbose, and performed sullying allocations on the incoming data. I've started from what is here, now, and pushed up some small suggested changes to the body, docs, and tests (I've opened a PR to your branch, @pdeffebach). I've taken out Regarding the column-selection syntax: to prevent clutter in the body, I think we could really just send in the |
As I said I think we can support both using the same argument. Not sure whether Do you want to finish this PR @pdeffebach? Else, @ryapric would better open a new PR, reviewing PRs against PRs will only help confusing everybody. |
i still dont know how git works tbh
@nalimilan I would like some feedback for how to deal with the imputation of missing values. Looking at the Let me know if there is a common way this sort of thing is done. Thanks! |
@pdeffebach |
# TODO: make this more efficient by not creating a | ||
# full array of missing values. Instead, implement | ||
# this in the copyto! stage | ||
return fill(missing, nrow(df)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, I had missed the notification. RepeatedVector
can probably help here (if using a generator isn't enough).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ugh why did I not think of a generator. It's fixed now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out that a generator doesn't infer eltype correclty. Columns are of type Any
rather than Union{T, Missing}
.
RepeatedVector
doesn't work because it's not defined at this point in the code, so we can't use it here. (Should we have a place for all struct defintions?)
Let me know what to do next. I think there is an issue for this generator issue but I can't find it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. Maybe with Iterators.repeated
? (eltype
won't be defined for generators since it needs inference.)
length(header) == 0 && return DataFrame() | ||
cols = Vector{AbstractVector}(undef, length(header)) | ||
for (i, name) in enumerate(header) | ||
data = [df[name] for df in dfs] | ||
# TODO: replace with commented out code after getindex deprecation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not looked at the PR in detail - but it seems you re-introduce some old code that was already removed here (at least in the comments)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleted.
if !isempty(coldiff) | ||
# if any DataFrames are a full superset of names, skip them | ||
filter!(u -> Set(u) != Set(header), uniqueheaders) | ||
estrings = Vector{String}(undef, length(uniqueheaders)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
map(enumerate(uniqueheaders)) do (i, head)
would be quite simpler here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of my suggestion was that you can do estrings = map...
, and not set estrings[i] =
inside the anonymous function, but just return the string.
test/cat.jl
Outdated
C = [missing, missing, missing, missing, missing, missing, 13, 14, 15]) | ||
end | ||
|
||
@testset "vcat on empty dataframe in loop" begin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the point of this test. What's so special about loops?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought at one point you wanted a test that mimicked the desired use of this functionality. But I can delete it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well a loop isn't different from calling vcat
on two data frames, so...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleted that testset.
test/cat.jl
Outdated
@@ -249,8 +302,6 @@ end | |||
@test err.value.msg == "column(s) B are missing from argument(s) 1" | |||
# multiple missing 1 column | |||
err = @test_throws ArgumentError vcat(df1, df2, df2, df2, df2, df2) | |||
err2 = @test_throws ArgumentError reduce(vcat, [df1, df2, df2, df2, df2, df2]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this still throw an error after the PR?
@@ -1017,20 +1017,39 @@ Base.hcat(df1::AbstractDataFrame, df2::AbstractDataFrame, dfn::AbstractDataFrame | |||
makeunique=makeunique, copycols=copycols) | |||
|
|||
""" | |||
vcat(dfs::AbstractDataFrame...) | |||
vcat(dfs::AbstractDataFrame...; columns::Union{Symbol, AbstractVector{Symbol}}=:equal) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, but since this PR was opened we realized we use cols
rather than columns
in general (e.g. in function names), so better call this cols
I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed to cols
throughout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there are still occurrences of columns
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed in documentation. I don't think there is anywhere else.
Co-Authored-By: pdeffebach <23196228+pdeffebach@users.noreply.github.com>
Co-Authored-By: pdeffebach <23196228+pdeffebach@users.noreply.github.com>
Co-Authored-By: pdeffebach <23196228+pdeffebach@users.noreply.github.com>
Co-Authored-By: pdeffebach <23196228+pdeffebach@users.noreply.github.com>
Co-Authored-By: pdeffebach <23196228+pdeffebach@users.noreply.github.com>
Co-Authored-By: pdeffebach <23196228+pdeffebach@users.noreply.github.com>
Co-Authored-By: pdeffebach <23196228+pdeffebach@users.noreply.github.com>
Co-Authored-By: pdeffebach <23196228+pdeffebach@users.noreply.github.com>
Co-Authored-By: pdeffebach <23196228+pdeffebach@users.noreply.github.com>
Co-Authored-By: pdeffebach <23196228+pdeffebach@users.noreply.github.com>
Okay I have responded to your most recent review. |
test/cat.jl
Outdated
@@ -347,25 +350,31 @@ end | |||
|
|||
@testset "vcat out of order" begin | |||
df1 = DataFrame(A = 1:3, B = 4:6, C = 7:9) | |||
<<<<<<< HEAD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rebase bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed. I thought i checked the tests.
if !isempty(coldiff) | ||
# if any DataFrames are a full superset of names, skip them | ||
filter!(u -> Set(u) != Set(header), uniqueheaders) | ||
estrings = Vector{String}(undef, length(uniqueheaders)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of my suggestion was that you can do estrings = map...
, and not set estrings[i] =
inside the anonymous function, but just return the string.
Co-Authored-By: pdeffebach <23196228+pdeffebach@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Sorry for the repeated rebasing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pdeffebach Thank you.
It would be good to add the following tests (they should pass given my analysis of your code, but still it is good to have them IMO):
- passing a
DataFrame
with 0-rows as one of the arguments (but having some columns), e.g.DataFrame(A=Int[], B=Float64[])
- passing
cols
as a vector containing a column name not present in any of the data frames - passing
cols
as a vector containing a duplicate column name - passing a
SubDataFrame
as one or all of the arguments; a particular case to check could be passing aSubDataFrame
containing a duplicate column name
Please let me know how you see this. After this we can merge this and I will make a release.
@pdeffebach Please let me know when you would be available to add/discuss the additional tests I have proposed (I would like to have new |
@pdeffebach - I have added the requested tests and done some layout improvements and merged the PR. Thank you for the contribution. |
Thank you for putting this over the finish line! So I didn't respond earlier. |
Thank you for the initiative of the improvement. |
This PR adds some functionality for a "flexible" vcat that has more feature parity with dplyr's
bind_rows
and Stata'sappend
.It introduces three keyword arguments:
widen
specifies whether or not the new dataframe contains the union of all the columns in its argument list. Any gaps are filled with missing.keep
specifies a vector of column names that the resulting dataframe will have.fillvalue
indicates what value we will use to fill in all missing data.widen
can still be necessary withkeep
: ifkeep
includes a symbol that is not present in all dataframes in the argument list, it will throw an error withoutwiden
.This is very much a WIP. We need to do the following (also commented in the code)
fill
part very efficient. Currently it allocates new vectors ofmissings
andcopyto!
s them later to fill in the columns.