-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metadata support to DataFrames #1413
Conversation
This is awesome. 🎉 I don't love the names |
Well, I hope the method names will be the only thing to change ... ;-) |
Thanks for taking the initiative! Here are a few general remarks:
|
Done, new method names are:
I agree we have no strong use cases to use For instance, we may suggest users to adopt
I'm not sure I understood. Are you proposing:
? This approach is not flexible enough since you would not have column specific keys (i.e.
but also this approach is not convenient since Or maybe I'm missing something?
Sorry, I don't understand why. Data and metadata live in two separate objects, and Moreover, when a column is added/deleted the corresponding Dict is added/deleted accordingly. Could you please elaborate on this ? |
Update of the first comment: Metadata are internally stored as a Metadata access is performed through the following methods:
Example:
This PR adds no package dependencies and is backward compatible. All new methods have their own docstring, and a new module has been added to test the new facility (in |
Ah, indeed now that you mention I wonder whether it wouldn't be better to provide a single
It would be confusing to allow for both strings and symbols. Anyway they are displayed the same, so I'm not sure why you say strings are better in that case?
I mean the former (but with Both approaches are equivalent from the user's POV, it's just a matter of efficiency in typical use cases. Updating the meta-data when the index is modified isn't an issue, it just requires a few additional function calls when adding or removing columns.
I mean that things like |
Piggybacking on the point about indexing - how are the |
Good point. |
R supports metadata via |
@pdeffebach I would take issue with your statement that R attributes are only useful for data frames. Any complexity of R (and before R, S) objects is determined by attributes. Dimensions of matrices and higher-order arrays, for example, are stored as attributes. The only primitive R data objects, in the sense of SEXPREC's, are fixed-length vectors of 32-bit integers, fixed-length vectors of 64-bit floats, fixed-length vectors of interned character strings, and fixed-length vectors of pointers to SEXPRECs. Everything else is coded in the attributes. I don't think that R attributes are a good model for this facility. |
I think that @nalimilan's suggestion of having an extractor that returns a |
@gcalderone Maybe wait until the resolution of the discussion in the Discourse thread to avoid wasting your time if we choose the alternative approach (storing meta-data in vectors). Reference to previous discussion: #35. |
…slicing a DataFrame
Implemented metadata copying while copying/slicing/creating a view of the DataFrame. I also added a Examples:
|
Sorry for the delay. After thinking a bit more about this, I think it would be cleaner to defined a Then, as noted above, you don't need to add all these new functions: people can get and set meta-data using The definition of the type and of its methods should go to a separate src/dataframemeta/dataframemeta.jl file. Please also drop |
Okay as far as I can tell this means
Then we can do this with just
Without the introduction of |
I also see what you mean with regards to The way that this PR would get around this is by having @gcalderone @nalimilan should we finalize the changes to |
Yes, better keep them separate. I'm not sure why renaming should affect meta-data: column-specific meta-data should be stored by position rather than by name, so that it only needs to be adjusted when reordering columns (which is less frequent).
Yeah, I guess it makes sense to start with a minimal implementation. However it should probably support at least a few operations so that it's testable and at least minimally usable. |
I was trying to implement a lot of the functions for Then I realized that's the whole point of the I will worry about global metadata later since that seems like it will be easier to add after the indexing and renaming functions are taken care of. edit: It should probably be a |
Sorry, I don't understand. Wouldn't storing column-specific meta-data in a vector allow precisely to be agnostic to column names and only use their integer indices instead? Then you can use the index to get the integer index from the name, which the code already does anyway. |
Yes, that is exactly what I am trying to implement. A first step will be to have In the future, we might not want to initiate an array of empty However this just means clever versions of |
OK. But then why use dicts when vectors would be simpler and more efficient? |
The purpose of using I am imagining a "default" I also changed my mind about a vector of |
We clearly need a dictionary to map the user-defined meta-data types to their values. But better store values for each type of meta-data in a vector with one entry per column. Yes, you need to resize the vector each time you add a column, but calling |
Are you saying a vector
If that's the case, I'm not sure thats a great idea because Here is what I have written. I just finished the
So if we have a dataframe
Our
and we want Then there is a constructor that makes a new If I understand correctly, you are saying that it is this copying of |
That's not a big deal. Typically that will use 64 bits per column, which is nothing compared to the size of the columns themselves. And dicts consist in three arrays, using 336 bits by default even when empty for
Exactly. I assume it's unlikely you will use lots of meta-data fields that will differ from one column to another. As I noted copying a dict involves copying three vectors, plus five |
Thank you for the guidance! So my impression is that your vision of
Then Unless you wanted a more ` |
Yes, more or less something like that. It's fine to have a single dict to map meta-data fields to the vectors that hold the values. |
This may be outside the desired scope of this effort, but have you considered extending support to include both column and row metadata? An example of where this would be useful would be something like genomic data where each row might be a gene, and each column a sample or patient. Having associated metadata for both would be useful. |
Per-row metadata is just... a column? Am I missing something? Do you know of other software which supports this? |
@nalimilan I guess I am thinking of cases where you have something more like homogeneous numeric data in most columns and you wouldn't necessarily want to mix in "metadata" columns that represent something else. In this case though I suppose an annotated multidimensional array is really what I am looking for.. As far as something already implementing this, I've started work on something like this in R, but it is still very immature and far-from-perfect, which is why I'm exploring what's been done in other communities ;) |
I guess you could use per-column meta-data to indicate which columns are "real data" and which are "metadata". :-) We haven't implemented anything to select columns based on criteria (regexes, name ranges, types...) yet, but we should certainly investigate this area (like dplyr and JuliaDB). |
I am not entirely clear on the relation of this PR and #1458. So my question is: which of them should be left open, or they both should be closed and we should reopen a new PR that is rebased to v0.19 and implements target functionality (from my experience it is sometime simpler than trying to update old PRs)? |
This PR adds metadata support to DataFrames. The idea for this PR comes from this discussion
Metadata are internally stored as a
Dict{Any,Any}
, one for each column and one for the whole table.Metadata access is performed through the following methods:
metaget(df::DataFrame, key; default=nothing)
: returns the metadata entry with keykey
from the table dictionary. If the key is not present the value of thedefault
keyword will be returned;metaget(df::DataFrame, column::Symbol, key; default=nothing)
: returns the metadata entry with keykey
from the columncolumn
dictionary. If the key is not present the value of thedefault
keyword will be returned;metaset!(df::DataFrame, key, value)
: set an entry in the table metadata dictionary with keykey
and valuevalue
;metaset!(df::DataFrame, column::Symbol, key, value)
: set an entry in the columncolumn
metadata dictionary with keykey
and valuevalue
;metadict(df::DataFrame)
: return the table dictionary;metadict(df::DataFrame, column::Symbol)
: return the columncolumn
dictionary.Example:
This PR adds no package dependencies and is backward compatible. All new methods have their own docstring, and a new module has been added to test the new facility (in
test/meta.jl
)