-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metadata for columns and/or DataFrames #35
Comments
Yes, I think this could be useful. At the DataVec level, we will need meta-data for factor-like behavior (#6). And some of the other things you suggest make sense too. On the other hand, we probably want to rely less on arbitrary attributes, like R, and more on types, when there's the possibility of doing so. |
I would love to see this, wrote about my wish for better support for things like questionnaires with the code book integrated with the code here (towards the bottom): http://reganmian.net/blog/2013/10/02/likert-graphs-in-r-embedding-metadata-for-easier-plotting/... Of course also raises the issue about serialization. |
If we can make this work without performance degradation, I'm in. |
Standardizing on a few meta-data attributes like variable label and unit would be wonderful. In R, Harrell's Hmisc offers this feature, but unfortunately very few package use it since it's not standard at all. OTC, SAS has built-in support for variable labels, which are used e.g. to label tables and plot axes automatically. Stata also has this concept, and even allows associating longer "notes" to variables, to explicit their meaning. More specialized attributes like question names would be useful, if there was an easy way for a separate package to create and use them. |
Adding units should be trivial, especially if Julia settles on a standard unit package soon. What are the variable labels for: descriptions of the columns to supplement the brief names? |
Yeah, variable labels are just the readable, complete name of the variable, as opposed to the abbreviated form that is practical to type (no spaces, no special characters...) but often cryptic and ugly which is used for variable names. The most typical use of variable labels is when you want to provide a good default for axes labels, like "Annual GDP growth", rather than "GDPG". They could also be useful to describe the contents of a database, with a function like Hmisc's describe() [1] or SAS's proc contents. 1: http://www.inside-r.org/packages/cran/Hmisc/docs/describe |
Yes, I agree with all of this. A long, human-readable Name (which could be LoM could be very handy for statistical modeling routines and the creation On Sun, Oct 6, 2013 at 11:54 AM, Milan Bouchet-Valat <
|
All great ideas. long name, LoM (for example I'd love to indicate that something is a likert-item, which is more specific than just categorical), etc. Not sure what is meant by domain? Units of course useful for measurements. An open ended comment field would be great for code book stuff (how data is collected, coded etc) - I could see some great ways of showing this, especially in the web view. Not sure how this would fit in, but in my R code, I also have the concept of grouping columns - for example having five groups of questions. Also curious about how we serialize this - we can't just spit this out into CSV again. What's DataFrame's "native" format for storing all this metadata? Ideally it would be something that was compatible with other tools as well. HDF5? |
I think separating Likert scales from other categorical variables might be too specific for something as generic as DataFrames: what functions would apply to them that don't apply to other categorical variables? I believe domain is meant in the math sense of "allowable, but not necessarily present, values for entries in this column". We once had grouped columns, but they were dropped because they proved difficult to maintain. They need to added back in, but it takes a good chunk of work to do. Serialization is kind of a nightmare. I think HDF5 may work, but that's a question for people with more expertise than I have in our current serialization infrastructure. |
I was thinking of LoM as user defined, for example I might want to graph Stian On Mon, Oct 7, 2013 at 11:57 AM, John Myles White
http://reganmian.net/blog -- Random Stuff that Matters |
You could actually do that already: you'd just make a |
I think serialization becomes an important issue - many of these things can On Mon, Oct 7, 2013 at 12:04 PM, John Myles White
http://reganmian.net/blog -- Random Stuff that Matters |
If you add support for arbitrary meta-data attributes to DataFrames, it will be easy for separate packages to mark some columns as grouped using a group index. No need to hardcode support for every specific feature - just make it easy to extend. |
What would arbitrary metadata consist of? A Dict called metadata that people can do anything with? |
Sure, a Dict containing vectors with one value per column, or even just a DataFrame, since attributes would all have the same length. Only standard attributes would have a pre-specified type, others would be free. Of course setters and getters would make the whole process transparent. |
If you're up for making a demo with that approach, it'd be nice to see. My instinct is that trying to avoid pre-specified types is going to make things slow, but I could be wrong. |
I like the idea of metadata, but I'm worried that it complicates things, especially if applied to a DataFrame. As John said, a demo would be a great way to work things out. We once had a concept of column groupings that we eventually pulled out because it tended to complicate things. Trying out an implementation is the best way to judge the balance of additional complexity relative to its benefit. Applying metadata to columns but embedding that data into the DataFrame structure has issues. For example, I may create a DataFrame column that points to a DataArray originally in a different DataFrame like: It's easier to attach metadata to DataArrays or other column data. Then, the metadata goes with columns. Nothing really needs to change in the DataFrame structure. |
I've never really programmed in Julia yet, so I cannot promise anything... Tom's point about storing meta-data directly in DataArrays sounds interesting for attributes that make sense when columns are taken in isolation (i.e. for label, unit...). It would not make much sense for column groupings, since a group index taken alone does not mean much. But that may not be an issue: if you take a column out of it's original DataFrame, you know that you're breaking its grouping with other columns. I kind of like this solution: it means the meta-data would be preserved when passing the DataArray directly to a function, which could happen in many cases. |
Here's the metadata I'm on board with adding permanently:
Here's the metadata I like, but don't feel comfortable committing to just yet:
FWIW, I'm used to people storing a description of the levels of cryptic enums in the description field of column tables in RDBMS. |
PDAs were originally intended to be a performance/memory optimization, not On Wed, Jan 29, 2014 at 4:22 PM, John Myles White
|
Agreed that PDA's were an optimization, but they've gotten used as factors. I wrote about the limitations of PDA's after "an epiphany" described in JuliaStats/DataArrays.jl#50. Summary: R gets a lot of mileage out of storing information about factor levels in vectors, but that's because each subset (including singleton-elements) retains information about the vector as a whole. Since Julia has proper scalars, factors need to be represented using a new scalar type, which will probably end up looking like Enum's. |
I don't get why we would need a Starting with column labels and not supporting units is reasonable. The essential point is to make the system extendable so that new attributes can be added in the future (custom attributes too?). Finally, there's the question of whether some meta-data should be stored in |
Regarding "Nullable", can't you just use As far as what we store in metadata, maybe we can use a Dict for that to allow storing different fields, and standardize on a few common names. Regarding where to store the metadata, in this thread above, I outlined adding metadata to the columns. That helps with So, if we stick metadata in the DataFrame (or Index), we might need a structure that carries the data and metadata to handle the |
We don't need to store a nullable attribute. We just need to expose that information through an interface. But it might be faster to check a BitVector than to check the type tag of each column. Let's worry about implementation later and focus on design first. I'm not really ready to embrace custom attributes just yet, since it fragments the community if some people's DataFrames have properties that other DataFrames don't share. Let's think about whether we should have them later. For factors, the levels of the factor will be stored in the type system, not in a DataArray. |
I don't think we should store metadata in columns. AFAIK, RDBMS systems don't do that: these properties are attached to the specific table and don't come along for the ride with the values in that table. So I'd rather require the user to do |
Sounds good, John. |
I'm not very familiar with database management systems, but it seems to me it would be convenient and completely logical to preserve column labels if you copy a column to another A more general issue I'm thinking about is that if meta-data is attached to the |
I think people should always specify their axes labels manually if they don't want defaults. |
Of course, I don't deny that. I was talking about the impossibility for |
That's true. But we'll never offer anything as smooth as R's kind of defaults, where you have access to information about the calling context. I think people can get used to explicitness. |
With column labels, we can actually offer something much more useful than R's defaults. In R most of the time the default axis label is ugly or useless, e.g. |
We can use column labels for interfaces that take in DataFrames as arguments in the way that Gadfly does. For things that work with vectors, they should not assume labels will exist. Otherwise they're broken for normal Arrays. |
Do we really want arbitrary metadata in the type? Seems like there are generally other ways to include metadata about your DataFrame w/o stuffing it into the type itself. |
I think we something like that would be useful, even if that's not the highest priority. How could you store metadata about columns without support in |
I agree with the above. If the goal is to have easy plotting (automatic labels), and easy table creation, forcing a long list of packages to interact with a third labeling package would be far more difficult to maintain than incorporating metadata into dataframes. |
closes #34 use TimeZones to support R's POSIXct (some non-IANA timezone codes generated by R not supported)
Closed with #3055 |
Should we leave room for metadata on structures? Frank Harrell's Hmisc package allows units and labels to be attached to data.frame columns.
People may want to attach other metadata like experimenter name or a DataFrame comment.
We could add a meta Dict to the DataFrame, the colindex, and/or at the DataVec level.
The text was updated successfully, but these errors were encountered: