-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metadata on data frame and column level #3055
Conversation
Another option to consider is to use metadata only if there are no conflicts between input data frames (i.e. it's present in one but absent from others, or equal in all data frames that have it). The advantage is that it would be order-independent. FWIW, R's |
For joining in Stata, the left data frame takes precedence. I think this is the correct default, and we should do it in DataFrames.jl as well. See this gist describing Stata's behavior. For |
You mean that if left and right table have the same "table level" (not column level) metadata key, then value is kept from left table? (please keep in mind that we will have two kind of metadata: table level and column level; now we are discussing table level metadata) |
Ah. Sorry for the confusion. Just did some research. It looks like Stata does not have named dataset-level dataset, for example "Date" or "Source". It's just a vector of strings. So Stata doesn't deal with this explicitly. All the notes just get added together. But I still think having the left one be dominant is the right way to go. |
Don't you think it would be confusing or even dangerous if doing EDIT: joins are different as in |
Good point. But still, |
I would also agree that having the left data be dominant makes sense. It's the table for which you're keeping all keys (+ rows) and the joining table is "additional", so it feels like that would make sense to me. |
@quinnj You're thinking about |
@nalimilan - can you please have a look at the implementation? If it is OK for you I will go ahead and add:
I have implemented both table and column level metadata. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, looks good! The dict of dicts approach to store per-column metadata can always be improved later if needed.
We should decide on it now. The reason is that breaking internals of |
Yes. More precisely a |
Ah - now I see we do not need to wait for #3047 as I intentionally kept there only functions that do not mutate list of columns. Problematic will be e.g. |
Looking at the PR right now, is it true that if the column
will destroy that metadata? |
Yes. The idea is that |
only if |
Okay. I guess the equivalent in Stata is |
Could you please elaborate what you mean there? Thank you! |
@nalimilan - I am done with the updates after your review. metadata.jl is significantly refactored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm lost in all these tests. I guess that means they cover almost everything. :-D
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
Co-authored-by: Milan Bouchet-Valat <nalimilan@club.fr>
@nalimilan - I have applied all suggestions. Things to discuss that I left unresolved:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, let's see how it goes! :-)
Thank you! We are almost at 1.4 release. |
This PR waits for JuliaData/DataAPI.jl#48.
I have done an initial implementation. Now we need to discuss for which methods metadata propagation should happen. For now I have implemented it for
getindex
.I stopped at
hcat
- if wehcat
several data frames, how do you think we should handle metadata. Options are:Which one do we pick (when we have this decision it will naturally propagate to other cases).