Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flexible vcat #1659

Merged
merged 82 commits into from
Apr 26, 2019
Merged

Flexible vcat #1659

merged 82 commits into from
Apr 26, 2019

Conversation

pdeffebach
Copy link
Contributor

@pdeffebach pdeffebach commented Dec 26, 2018

This PR adds some functionality for a "flexible" vcat that has more feature parity with dplyr's bind_rows and Stata's append.

It introduces three keyword arguments:

  1. widen specifies whether or not the new dataframe contains the union of all the columns in its argument list. Any gaps are filled with missing.

  2. keep specifies a vector of column names that the resulting dataframe will have.

  3. fillvalue indicates what value we will use to fill in all missing data.

widen can still be necessary with keep: if keep includes a symbol that is not present in all dataframes in the argument list, it will throw an error without widen.

This is very much a WIP. We need to do the following (also commented in the code)

  • Preserve order of the first dataframe in the argument list, if possible
  • Get all the errors correct
  • Make the fill part very efficient. Currently it allocates new vectors of missings and copyto!s them later to fill in the columns.

@nalimilan
Copy link
Member

Thanks. Do we really need keep though? It's easy enough to get rid of some columns before calling vcat (which isn't possible in Stata). Also, do you have a use case in mind for fillvalue? Maybe we can hardcode missing for now and see whether people actually need more flexibility.

@pdeffebach
Copy link
Contributor Author

pdeffebach commented Dec 27, 2018

I think keep is important. Imagine the following scenario

df1 = DataFrame(rand(10, 10)) # manageable width
df2 = DataFrame(rand(10, 100)) # too wide to want to work with
df3 = DataFrame(x1 = rand(10), x45 = rand(10))

Now you want to append but you only care about the columns :x1, :x2, and :x45. With just widen you would get all 100 columns in df2. And we can't subset df1 because :x45 isn't a column there. Similarly we have to be careful with subsetting df3.

In a practical example, in Stata, I often have 2 dataset from different endline surveys, and I only care about questions that are common to both surveys, but there are like 200 questions so I don't want to bother with which ones are and are not in the Endline 1 survey. Even if I could subset the second survey, I wouldn't want to have to figure out which questions are in both and which are not.

I'm fine to remove fillvalue and make missing the mandatory.

EDIT: Or, rather, I want to keep all the questions from the second endline, but not any that are only in the first endline. I think that's the scenario that is difficult... I also didn't want any default whereby the columns from the first DataFrame in the arguments is kept, because that seemed arbitrary.

@nalimilan
Copy link
Member

Then maybe what's needed is actually an argument which can be :inner, :outer, :left or :right, as in join?

@pdeffebach
Copy link
Contributor Author

Yeah that could be consistent. The issue I found is that everything in vcat is generalized for splatting a vector of DataFrames, so I wasn't sure about making some sort of assumption about the "first" data frame in the argument list.

@nalimilan
Copy link
Member

I guess :left could refer to the first argument, and :right to the last one. Or we could throw an error when there are more than two arguments.

@pdeffebach
Copy link
Contributor Author

:upper, :lower, :inner, :outer is a good strategy. I'm sure that's what dplyr would have done if rbind wasn't so widely used already.

With more than 2 arguments, we could encourage people to use reduce or a for loop when vcating multiple DataFrames. Throwing an error would be quite breaking though, and I don't know how much people are calling vcat on a vector of DataFrames.

@ryapric
Copy link

ryapric commented Jan 11, 2019

Coming over from the reference PR #1672 above.

A couple of initial thoughts out loud regarding both workstreams:

  • Upon further review, I think both of these approaches might be a bit bloated. There already exists a method extension for append! in src/dataframe.jl. Would it not be better to modify this extension to achieve the desired effect, instead of defining new functions/methods, like we've both done here?

    • Edit: sorry, just noticed as well that what's even weirder, is there are method extensions for both append! and vcat already on the master branch, in different src files. I'm not really sure about the logic there, i.e. why both exist.
  • I much prefer the splatting implementation you have here, instead of the array-based parameterization I have.

  • Maybe instead of having both widen and keep, there's just one arg that defaults to the effects of widen = true, but any non-bool passed arg value is a vector of columns to keep? I can understand the join-similar syntax of :inner, etc., but since these aren't joins, I'm not sure of how intuitive those terms would be to users.

  • I'm having a hard time trying to implement a clean fillvalue implementation myself, which requires type conversions on all respective DataFrame columns to the Union of each that is appended. I think there's two possibly good options there:

    • Keep missing as the default fill value, and let users act as they please

    • Let the fillvalue arg be set to missing as default. Then, just call something like:

      if !ismissing(fillvalue) mapcols(x -> replace!(x, missing => fillvalue)) end

    We may still run into coltype issues, though, but I'll know more once I try and implement that.

Still digging through everything, but that's what comes to mind so far.

@ryapric
Copy link

ryapric commented Jan 11, 2019

I'm also seeing a lot of hard-coded test code for non-intersecting columns on master when calling vcat, so if adjusting vcat is still the plan, then those tests would need to be replaced. Now I'm questioning again slightly whether or not to just make this a separate function, especially in the spirit of not breaking someone's existing code... vcatall? appendall!? bindrows? collapse?

There seems to be a lot more to consider than I initially thought, oof.

@nalimilan
Copy link
Member

append! is very different from vcat, as it resizes vectors in-place. So we need both.

* Maybe instead of having both `widen` and `keep`, there's just one arg that defaults to the effects of `widen = true`, but any non-bool passed arg value is a vector of columns to keep? I can understand the `join`-similar syntax of `:inner`, etc., but since these _aren't_ joins, I'm not sure of how intuitive those terms would be to users.

Note that OUTER UNION exists in at least some of the major SQL implementations. Though I guess we could have a single argument which could either be a Boolean, a vector of names, or :inner/:outer/:left/:right. We just need to find a good name for it.

I'm having a hard time trying to implement a clean fillvalue implementation myself, which requires type conversions on all respective DataFrame columns to the Union of each that is appended. I think there's two possibly good options there:

* Keep `missing` as the default fill value, and let users act as they please

* Let the `fillvalue` arg be set to `missing` as default. Then, just call something like:
  `if !ismissing(fillvalue) mapcols(x -> replace!(x, missing => fillvalue)) end`

We may still run into coltype issues, though, but I'll know more once I try and implement that.

I don't think there's a problem. We just need to allocate new columns using Tables.allocatecolumn(promote_type(eltype(col), eltype(fillvalue)), len). Better not special-case missing. And actually the existing code already does that, and the current state of this PR takes advantage of it.

I'm also seeing a lot of hard-coded test code for non-intersecting columns on master when calling vcat, so if adjusting vcat is still the plan, then those tests would need to be replaced. Now I'm questioning again slightly whether or not to just make this a separate function, especially in the spirit of not breaking someone's existing code... vcatall? appendall!? bindrows? collapse?

We've discussed these things a lot before in a Discourse thread. I prefer a single function since it's a standard Julia name and it's easier to discover. There's no reason why making the function more flexible should break code.

@ryapric
Copy link

ryapric commented Jan 14, 2019

So here's what I've managed to get to a WIP stage, myself, based on all the information you've all provided. What I've changed (from master) is:

  • Add a keepcols keyword to vcat, which defaults to nothing. This is entirely ignored if not supplied explicitly by the user, maintaining perfect backwards-compatibility with how vcat acts today. It is checked if non-nothing in a few places, and only serves to conditionally prepare the incoming data. Then the rest of the existing _vcat code is executed as normal.

  • keepcols accepts an array of Symbols, which are the column names to keep in the output DataFrame. The default (if invoking the new usage) is to pass [:allcols], which returns a DataFrame containing all of the columns across all passed DataFrames. Values are filled with missing if they don't exist.

  • Add some basic tests in advance so that I would know when the desired functionality was achieved. More to come.

  • Edit the ArgumentErrors thrown due to set-difference columns and not passing keepcols. The errors now will suggest the user to pass keepcols if this is the desired behavior. I have also edited the vcat errors section of test/cat.jl, so that the error outputs match these suggestions.

Were this to be ideal, it still needs at least a few things in my mind:

  1. A way to reset the allocations & type conversions done to the passed DataFrames. I'm open to hearing everyone's thoughts on that, though, since Julia is pass-by-reference anyway.

  2. Check to make sure the column order is preserved according to the first DataFrame. I haven't written tests for that yet, so maybe it's already behaving this way. Or, maybe that's not as important anymore?

  3. More tests (I wrote just enough in advance to see if the basic applications worked, but there's more I'd like to write).

  4. Reconsider the best name for the default arg to invoking keepcols. It could just be a single Symbol, :all, but then the type signature would be Union{Symbol, Array{Symbol}, Nothing}}. Not a problem, but verbose & unsightly, and only serving to allow development like this. If the join-like syntax is desired, then I don't think that would be too many tweaks from what exists now, so that's also an option.

  5. Update documentation once the functionality is finalized.

I'm of course open to feedback!

@pdeffebach
Copy link
Contributor Author

Could you explain how that differs from the current implementation in the current state of this PR?

@nalimilan How we should we decide whether to use a strategy that mimics join and whether to use a keepcols based strategy? I'm partial to the join approach, but I want to know how to make sure there is some consensus.

@ryapric
Copy link

ryapric commented Jan 14, 2019

It appears that I am quite dumb. I was getting weird merge conflicts when looking at this PR's branch vs. master, so I just tried to recreate it a different way. But I got it sorted out now, and the state of this PR seems nicer. To answer your question: the functionality is the same, but mine was more verbose, and performed sullying allocations on the incoming data.

I've started from what is here, now, and pushed up some small suggested changes to the body, docs, and tests (I've opened a PR to your branch, @pdeffebach). I've taken out widen, and let keep handle all the conditional resolution. I've left fillvalue as-is, but in some ad-hoc testing, the chance of anything other than missing as a value runs the risk of converting a whole column to Any. So, I've warned users in the docstring of such, and added a suggested use case for Missings.replace().

Regarding the column-selection syntax: to prevent clutter in the body, I think we could really just send in the keep value and some of the initial header objects to another internal function in the call to define header, to handle greater flexibility. For example, maybe :all and :outer are synonymous in terms of what columns to select. So, both could resolve to unionunique. Maybe the same idea for :inner, :same, :intersect, etc, in that they all resolve to the same selection. This function could (and should) still be able to take/return a vector of Symbols for finer selection control.

@nalimilan
Copy link
Member

@nalimilan How we should we decide whether to use a strategy that mimics join and whether to use a keepcols based strategy? I'm partial to the join approach, but I want to know how to make sure there is some consensus.

As I said I think we can support both using the same argument. Not sure whether keepcols is the best name, need to think about it.

Do you want to finish this PR @pdeffebach? Else, @ryapric would better open a new PR, reviewing PRs against PRs will only help confusing everybody.

@bkamins bkamins mentioned this pull request Jan 15, 2019
31 tasks
@pdeffebach
Copy link
Contributor Author

@nalimilan I would like some feedback for how to deal with the imputation of missing values. Looking at the map block on line 1049, you can see that I just return a vector of length nrow(df) of missings. I'm sure there is a better way to do this. I thought about having some sort of flag that we return, like nothing and then filling missings with copyto! based on that. But that seems non-type stable, correct?

Let me know if there is a common way this sort of thing is done. Thanks!

@bkamins
Copy link
Member

bkamins commented Jan 22, 2019

@pdeffebach
I wanted to write to you here about #1691 again, but it seems no test files were changed in this PR 😸.

# TODO: make this more efficient by not creating a
# full array of missing values. Instead, implement
# this in the copyto! stage
return fill(missing, nrow(df))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, I had missed the notification. RepeatedVector can probably help here (if using a generator isn't enough).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh why did I not think of a generator. It's fixed now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It turns out that a generator doesn't infer eltype correclty. Columns are of type Any rather than Union{T, Missing}.

RepeatedVector doesn't work because it's not defined at this point in the code, so we can't use it here. (Should we have a place for all struct defintions?)

Let me know what to do next. I think there is an issue for this generator issue but I can't find it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. Maybe with Iterators.repeated? (eltype won't be defined for generators since it needs inference.)

length(header) == 0 && return DataFrame()
cols = Vector{AbstractVector}(undef, length(header))
for (i, name) in enumerate(header)
data = [df[name] for df in dfs]
# TODO: replace with commented out code after getindex deprecation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not looked at the PR in detail - but it seems you re-introduce some old code that was already removed here (at least in the comments)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted.

src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
if !isempty(coldiff)
# if any DataFrames are a full superset of names, skip them
filter!(u -> Set(u) != Set(header), uniqueheaders)
estrings = Vector{String}(undef, length(uniqueheaders))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

map(enumerate(uniqueheaders)) do (i, head) would be quite simpler here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of my suggestion was that you can do estrings = map..., and not set estrings[i] = inside the anonymous function, but just return the string.

test/cat.jl Outdated Show resolved Hide resolved
test/cat.jl Outdated Show resolved Hide resolved
test/cat.jl Outdated
C = [missing, missing, missing, missing, missing, missing, 13, 14, 15])
end

@testset "vcat on empty dataframe in loop" begin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the point of this test. What's so special about loops?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought at one point you wanted a test that mimicked the desired use of this functionality. But I can delete it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well a loop isn't different from calling vcat on two data frames, so...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted that testset.

test/cat.jl Outdated
@@ -249,8 +302,6 @@ end
@test err.value.msg == "column(s) B are missing from argument(s) 1"
# multiple missing 1 column
err = @test_throws ArgumentError vcat(df1, df2, df2, df2, df2, df2)
err2 = @test_throws ArgumentError reduce(vcat, [df1, df2, df2, df2, df2, df2])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this still throw an error after the PR?

@@ -1017,20 +1017,39 @@ Base.hcat(df1::AbstractDataFrame, df2::AbstractDataFrame, dfn::AbstractDataFrame
makeunique=makeunique, copycols=copycols)

"""
vcat(dfs::AbstractDataFrame...)
vcat(dfs::AbstractDataFrame...; columns::Union{Symbol, AbstractVector{Symbol}}=:equal)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but since this PR was opened we realized we use cols rather than columns in general (e.g. in function names), so better call this cols I guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to cols throughout

Copy link
Member

@nalimilan nalimilan Apr 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are still occurrences of columns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed in documentation. I don't think there is anywhere else.

nalimilan and others added 13 commits April 23, 2019 15:15
Co-Authored-By: pdeffebach <23196228+pdeffebach@users.noreply.github.com>
Co-Authored-By: pdeffebach <23196228+pdeffebach@users.noreply.github.com>
Co-Authored-By: pdeffebach <23196228+pdeffebach@users.noreply.github.com>
Co-Authored-By: pdeffebach <23196228+pdeffebach@users.noreply.github.com>
Co-Authored-By: pdeffebach <23196228+pdeffebach@users.noreply.github.com>
Co-Authored-By: pdeffebach <23196228+pdeffebach@users.noreply.github.com>
Co-Authored-By: pdeffebach <23196228+pdeffebach@users.noreply.github.com>
Co-Authored-By: pdeffebach <23196228+pdeffebach@users.noreply.github.com>
Co-Authored-By: pdeffebach <23196228+pdeffebach@users.noreply.github.com>
Co-Authored-By: pdeffebach <23196228+pdeffebach@users.noreply.github.com>
@pdeffebach
Copy link
Contributor Author

Okay I have responded to your most recent review.

test/cat.jl Outdated
@@ -347,25 +350,31 @@ end

@testset "vcat out of order" begin
df1 = DataFrame(A = 1:3, B = 4:6, C = 7:9)
<<<<<<< HEAD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebase bug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed. I thought i checked the tests.

if !isempty(coldiff)
# if any DataFrames are a full superset of names, skip them
filter!(u -> Set(u) != Set(header), uniqueheaders)
estrings = Vector{String}(undef, length(uniqueheaders))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of my suggestion was that you can do estrings = map..., and not set estrings[i] = inside the anonymous function, but just return the string.

src/abstractdataframe/abstractdataframe.jl Outdated Show resolved Hide resolved
test/cat.jl Outdated Show resolved Hide resolved
test/cat.jl Outdated Show resolved Hide resolved
nalimilan and others added 5 commits April 24, 2019 15:33
Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Sorry for the repeated rebasing.

@nalimilan nalimilan changed the title WIP: Flexible vcat Flexible vcat Apr 25, 2019
Copy link
Member

@bkamins bkamins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pdeffebach Thank you.

It would be good to add the following tests (they should pass given my analysis of your code, but still it is good to have them IMO):

  • passing a DataFrame with 0-rows as one of the arguments (but having some columns), e.g. DataFrame(A=Int[], B=Float64[])
  • passing cols as a vector containing a column name not present in any of the data frames
  • passing cols as a vector containing a duplicate column name
  • passing a SubDataFrame as one or all of the arguments; a particular case to check could be passing a SubDataFrame containing a duplicate column name

Please let me know how you see this. After this we can merge this and I will make a release.

@bkamins
Copy link
Member

bkamins commented Apr 25, 2019

@pdeffebach Please let me know when you would be available to add/discuss the additional tests I have proposed (I would like to have new vcat in the current release and I am planning to do it soon).

@bkamins bkamins merged commit 44d8cf1 into JuliaData:master Apr 26, 2019
@bkamins
Copy link
Member

bkamins commented Apr 26, 2019

@pdeffebach - I have added the requested tests and done some layout improvements and merged the PR. Thank you for the contribution.

@pdeffebach
Copy link
Contributor Author

Thank you for putting this over the finish line! So I didn't respond earlier.

@bkamins
Copy link
Member

bkamins commented Apr 26, 2019

Thank you for the initiative of the improvement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants