Skip to content
This repository has been archived by the owner on Jan 2, 2023. It is now read-only.

Issues about this prototype #1

Open
tk3369 opened this issue Sep 19, 2020 · 5 comments
Open

Issues about this prototype #1

tk3369 opened this issue Sep 19, 2020 · 5 comments

Comments

@tk3369
Copy link
Owner

tk3369 commented Sep 19, 2020

This prototype demonstrates how to implement AbstractDataFrame interface. The current implementation allows wrapping a data frame and possibly with additional data in the same struct. The idea came from this comment invenia/KeyedFrames.jl#19 (comment).

However, there are a number of unresolved caveats as documented below:

  1. I must define Base.getproperty (rather than forward) without running into trouble with the parent function
  2. I must define Base.setproperty! (rather than forward) to avoid method ambiguity.
  3. I must define Base.propertynames with private argument to make REPL completion work.
  4. I must define several Base.convert functions
  5. I must define findrows, findrow, manipulate, and SubDataFrame functions.

Issue 4 troubles me the most. For the Base.convert functions, they seems to be invoked during joins functions and eachcol. And it calls for converting between a regular DataFrame and my own type. If I have a type that contains more fields, there is no way to convert from a DataFrame without filling in dummy information as evident in this line of code (required for passing the test):

NDF(df::DataFrame) = NDF(df, "No name")

Issue 5 is also also quite troublesome because I have to extend some internal DataFrames.jl functions. I don't know why they do not take AbstractDataFrame type as argument e.g.
https://github.com/JuliaData/DataFrames.jl/blob/fb4e184f3b3997da8680668fbf7fcc789811eb18/src/dataframerow/utils.jl#L301-L305

@tk3369
Copy link
Owner Author

tk3369 commented Sep 19, 2020

@bkamins If you have a few minutes, may I ask you for some help in reviewing what I have done here? I went ahead and try to implement the AbstractDataFrame interface. I was able to make it work but it feels very hacky... This repo contains the implementation, and the test script shows that it works properly for most of the operations.

I have summarized the issues above. Thank you very much 🙏

@bkamins
Copy link

bkamins commented Sep 20, 2020

This is very nice. What I can recommend to check it (sorry for not being super-constructive with a concrete implementation) is to take the tests from DataFrames.jl and check what gets broken here.

As a general question. When you have AbstractWrappedDataFrame do you want it to be, after some operation, e.g. getindex unwrapped or you want to put a wrapper back around the returned object?

@bkamins
Copy link

bkamins commented Sep 20, 2020

Can you check if DataFrameRow is created correctly using this wrapper?

Also in general - all "core" types (like DataFrameRow, SubDataFrame, GroupedDataFrame, DataFrameRows, DataFrameColumns) should always get DataFrame or SubDataFrame (there are unfortunately scattered assumptions around the code that only either of the two types is stored, and in some cases it even has to be DataFrame only if you want to be on a safe side). Have a look at the constructors to see the cases.

Therefore - the pattern I would recommend is to make sure that you handle the constructors of these types explicitly (maybe the @forward macro can ensure just this - I do not have experience with it).

For example DataFrameRow is always created to contain DataFrame for performance (although its signature is more general):

julia> df = DataFrame(rand(2,3));

julia> dfv = @view df[:, 2:3];

julia> typeof(dfv[1, :])
DataFrameRow{DataFrame,DataFrames.SubIndex{DataFrames.Index,UnitRange{Int64},UnitRange{Int64}}}

@tk3369
Copy link
Owner Author

tk3369 commented Sep 21, 2020

As a general question. When you have AbstractWrappedDataFrame do you want it to be, after some operation, e.g. getindex unwrapped or you want to put a wrapper back around the returned object?

This is a great question. Generally speaking, I don't mind losing the original identity after certain operations. For example, doing joins would usually produce a completely different data frame. Likewise, select or transform would change the structure of the data frame as well.

So it goes back to why I want to do this in the first place. I want my own type such that I can define functions for my types even though the backing store is a DataFrame. When I have many types like this, it would be more convenient that I can take two objects and do some dataframe-like operations without having to unwrap them.

Thanks for the recommendations. Couple of updates:

  1. I realized that wdf[1, :] didn't work and so I fixed it by defining getindex with Colon and MultiColumnIndex with this commit 183be92

  2. I tried to forward the constructor for GroupedDataFrame but that made it worse with more problems... So I commented it out for now.

  3. I pulled the tests from test/broadcasting.jl and incorporated into this repo with a wrapped data frame. The results are in Add broadcasting tests (CI not passed yet) #4.

It seems to be quite broken at this stage... it makes me think whether I should still go down this path 😅

@bkamins
Copy link

bkamins commented Sep 21, 2020

Broadcasting probably should work just by doing:

Base.broadcastable(x::AbstractWrappedDataFrame) = dataframe(x)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants