-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Continue adding Metadata to dataframes #1458
Conversation
Describe to dataframe
Rearranged get_stats, fixed docstring, re-did data constructor, and other small fixes.
Resurrecting this a little bit. What is the API of metadata?
This works very nicely with broadcasting.
There are only a few helper functions I can think of:
With the third of these bullets, you could use broadcasting again to do
That is 4 functions added to the namespace of DataFrames. |
I'm gonna put together a larger document about how metadata is handled in various systems to try and work out an API. here is an issue on pandas which is eerily similar to discussions on discourse about this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the long delay. Below are a few reactions to your points.
I like the idea of having
metadata(df, :x1, :field) = "my information"
assign metadata and justmetadata(df, :x1, :field)
retrieve metadata. This means overridinggetindex
andsetindex
but only allowing them to be accessed viametadata()
?
No, that's not possible. The closest syntax which could work with setindex!
is metadata(df)[:x1, :field] = "my information"
or metadata(df, :field)[:x1] = "my information"
, assuming metadata
would return an object with a custom type. Else, we have to do metadata!(df, :x1, :field, "my information")
.
I guess the main advantage of having metadata(df)
return a special object is that we could use keys(metdata(df))
to get the list of keys. Then metadata(df)[:somekey]
would return a vector or a named tuple, metadata(df)[:, :somecol]
would return a named tuple of keys.
Here's a comparison of the syntaxes for common operations:
setindex! /getindex |
function | |
---|---|---|
Get key for all columns | metadata(df)[:somekey] |
metadata(df, :somekey) |
Get key for one column | metadata(df)[:somekey, :somecol] |
metadata(df, :somekey, :somecol) |
Get all keys for one column | metadata(df)[:, :somecol] |
metadata(df, :, :somecol) |
Set key for one column | metadata(df)[:somekey, :somecol] = "x" |
metadata!(df, :somekey, :somecol, "x") |
Get list of keys | keys(metadata(df)) |
metadatakeys(df) |
More generally, returning a special object would allow supporting more operations in the future without exporting new functions. For example, rename!(metadata(df), :key1 => :key2)
could be used to rename a key.
Mis-typing
I think that automatically making a new vector if you add metadata to a field that doesn't exist yet is dangerous, because then typos can allocate new fields silently. But this is small and can be addressed later.
Well, that's what happens if you mistype df[:mycol] = 1
already. To prevent that, we would have to require people to call addfield!
manually, which would be quite strict. Also that would make the API more complex.
joining dataframes calls the new constructor, so to make joins work we might have to touch some of the constructor code to allow metadata.
We will also have to profile this extensively to make sure we aren't losing performance for people who don't wish to use metadata.
Honestly I'm not too concerned about that, as the amount of work to do is really small compared with the join itself. However, in general it would be a good idea to check whether the meta-data is empty, and take a fast path in that case to eliminate the overhead.
notonright = setdiff(keys(leftmeta.dict), keys(rightmeta.dict)) | ||
|
||
for field in notonleft | ||
newfield!(leftmeta, length(leftindex), field, nothing) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you noted in the issue description, this approach is not very efficient, and it doesn't work for rightmeta
since it shouldn't be modified. Another way of doing this is in for key in keys(leftmeta.dict)
, to check whether the key exists in the left and right data frames. If it exists in both, call vcat
as you currently do. If it exists only in one of the data frames, allocate a Vector{Union{Nothing, eltype(key_vec)}}
and call copyto!
to fill the corresponding entries.
I have not been in this discussion earlier, and as I have mentioned to @pdeffebach ealier it would be great to have a write-up specifying what we want to achieve here. Looking at the code I understand that we want to be able to attach arbitrary metadata to columns of a I can see that the design is to have a dictionary mapping keys to vectors of values for all columns of a |
Yes, that's basically the goal. See #1413 and https://discourse.julialang.org/t/how-to-add-metadata-info-to-a-dataframe/11168. |
Thanks! Seeing the volume of the discussion I would recommend to document how it should work (even something like minimal dev docs would be good as we are getting to the point that the design of the whole package might become so complex that we are not sure when we change it one place what other parts might break). |
@bkamins thanks for the feedback. I have a plan to write a Stata script and some R package and do a battery of transforms and merges and see how they behave, but I haven't had time for it recently. However as far as I can tell there aren't that many systems out there that do this like Stata. So there will likely still be some behaviors unsettled. I will try and get a non-joining API working based on Milan's comments. Then as you suggested a markdown document will be very helpful. The whole set of changes is simple for now, or at least isolated to a few files (a lot of the conversation is based on my lack of git understanding), but will get far more complicated once joins and constructors get involved. Hopefully I can work on this next week. |
Great. So just a few issues that came to my mind and should be decided upon (maybe they are):
(of course maybe most of those in the first round can be ignored and an appropriate mechanisms added later - the key is to be sure that we do not loose sync between metadata and data) Also (referring to the discussion above) - if someone wants to assign metadata to the column that does not exist I think an error should be thrown. |
I don't think these issues are hard, really. It's fine to drop meta-data in some operations like joins or grouping for now. We just have to handle column insertion and deletion, or the feature would really be confusing (as we would have to drop all meta-data). |
Agreed they do not have to be hard 😄, I just learned the hard way that sometimes simple things are not so simple as they seem. But I agree that dropping metadata, at least for now, is OK. I would say that most of the time when you have metadata attached to a |
I would like to continue working on this now that the vcat has been merged and #1620 is about to be merged. Are we still on board with this general approach? Do either of you trust me to rebase this myself? |
Rebasing src/DataFrames.jl and src/abstractdataframe/join.jl should be easy. From my side I would greatly appreciate if you added a section in docs/src/lib/types.md on medatata laying out the design assumptions - this would help me review the implementation. Also please remember to update docs/src/lib/functions.md. Also adding some info to docs/src/man section would be welcome (but is not crucial for the review). Thank you! |
Bump @pdeffebach |
@pdeffebach If you plan to be at JuliaCon2019 this functionality is perfect to work on and discuss during the conference. We could move things forward fast there. |
I will unfortunately not be at JuliaCon. I have been at home for the summer before grad school so I haven't been thinking much about Julia. I will try rebasing today at the very least so this PR isn't dead. After that I will work on documentation to write a contract for all this behavior. |
Thank you. A "contract" is exactly what we need. As I have said earlier - having thing like https://juliadata.github.io/DataFrames.jl/stable/lib/indexing.html is a must, in my opinion, for a complex change like this because:
Adding metadata will be a fantastic enchantment and I know that many people would love to have it. |
@pdeffebach - probably we should close this and keep the discussion on what we do in #2961. OK? |
Sounds good! |
This PR picks up where #1429 left off, before I got messed up and deleted the branch.
Here are some outstanding issues with the current code:
Allocation for merge!
If
df
_left' has a:label
field anddf_right
does not, thenmerge!
creates a metadata for the unique columns ondf_right
and adds a:label
field populated with nothings. This is undesirable becauseDict
, and a new metadata in generalnothing
s to an array.My intuition is that (1) is bad but (2) is consistent with what
merge!
does for the columns of dataframes. So I should change the behavior to just appendnothing
s to all the fields in themetadata
fordf_left
. This is particularly important whendf_right
is empty. But the concept still applies for ifdf_left
anddf_right
have different fields.Getindex and setindex
I like the idea of having
metadata(df, :x1, :field) = "my information
assignmetadata
and justmetadata(df, :x1, :field
retrieve metadata. This means overridinggetindex
andsetindex
but only allowing them to be accessed viametadata()
?Mis-typing
I think that automatically making a new vector if you add metadata to a field that doesn't exist yet is dangerous, because then typos can allocate new fields silently. But this is small and can be addressed later.
Future work
join
ing dataframes calls the new constructor, so to makejoins
work we might have to touch some of the constructor code to allowmetadata
.We will also have to profile this extensively to make sure we aren't losing performance for people who don't wish to use metadata.