Metadata for columns and/or DataFrames #35

tshort · 2012-07-19T14:46:58Z

Should we leave room for metadata on structures? Frank Harrell's Hmisc package allows units and labels to be attached to data.frame columns.

People may want to attach other metadata like experimenter name or a DataFrame comment.

We could add a meta Dict to the DataFrame, the colindex, and/or at the DataVec level.

HarlanH · 2012-07-19T14:57:25Z

Yes, I think this could be useful. At the DataVec level, we will need meta-data for factor-like behavior (#6). And some of the other things you suggest make sense too. On the other hand, we probably want to rely less on arbitrary attributes, like R, and more on types, when there's the possibility of doing so.

houshuang · 2013-10-04T16:51:55Z

I would love to see this, wrote about my wish for better support for things like questionnaires with the code book integrated with the code here (towards the bottom): http://reganmian.net/blog/2013/10/02/likert-graphs-in-r-embedding-metadata-for-easier-plotting/...

Of course also raises the issue about serialization.

johnmyleswhite · 2013-10-05T14:40:24Z

If we can make this work without performance degradation, I'm in.

nalimilan · 2013-10-06T15:44:36Z

Standardizing on a few meta-data attributes like variable label and unit would be wonderful. In R, Harrell's Hmisc offers this feature, but unfortunately very few package use it since it's not standard at all. OTC, SAS has built-in support for variable labels, which are used e.g. to label tables and plot axes automatically. Stata also has this concept, and even allows associating longer "notes" to variables, to explicit their meaning.

More specialized attributes like question names would be useful, if there was an easy way for a separate package to create and use them.

johnmyleswhite · 2013-10-06T15:47:00Z

Adding units should be trivial, especially if Julia settles on a standard unit package soon. What are the variable labels for: descriptions of the columns to supplement the brief names?

nalimilan · 2013-10-06T15:54:20Z

Yeah, variable labels are just the readable, complete name of the variable, as opposed to the abbreviated form that is practical to type (no spaces, no special characters...) but often cryptic and ugly which is used for variable names. The most typical use of variable labels is when you want to provide a good default for axes labels, like "Annual GDP growth", rather than "GDPG". They could also be useful to describe the contents of a database, with a function like Hmisc's describe() [1] or SAS's proc contents.

1: http://www.inside-r.org/packages/cran/Hmisc/docs/describe

HarlanH · 2013-10-06T16:01:50Z

Yes, I agree with all of this. A long, human-readable Name (which could be
leveraged for axes labels by plotting routines), Units, and
Level-of-measurement would be very helpful. Possibly also Domain.

LoM could be very handy for statistical modeling routines and the creation
of appropriate model matrices (or the throwing of warnings). I never
intended PooledDataVector to be equivalent to Factor -- it's a
representational optimization. Would be much better for statistical
routines to look for Nominal or Ordinal types and act appropriately, even
if the underlying type is a non-pooled integer or string.

On Sun, Oct 6, 2013 at 11:54 AM, Milan Bouchet-Valat <
notifications@github.com> wrote:

Yeah, variable labels are just the readable, complete name of the
variable, as opposed to the abbreviated form that is practical to type (no
spaces, no special characters...) but often cryptic and ugly which is used
for variable names. The most typical use of variable labels is when you
want to provide a good default for axes labels, like "Annual GDP growth",
rather than "GDPG". They could also be useful to describe the contents of a
database, with a function like Hmisc's describe() [1] or SAS's proc
contents.

1: http://www.inside-r.org/packages/cran/Hmisc/docs/describe

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/35#issuecomment-25770479
.

houshuang · 2013-10-07T15:54:13Z

All great ideas. long name, LoM (for example I'd love to indicate that something is a likert-item, which is more specific than just categorical), etc. Not sure what is meant by domain? Units of course useful for measurements. An open ended comment field would be great for code book stuff (how data is collected, coded etc) - I could see some great ways of showing this, especially in the web view.

Not sure how this would fit in, but in my R code, I also have the concept of grouping columns - for example having five groups of questions.

Also curious about how we serialize this - we can't just spit this out into CSV again. What's DataFrame's "native" format for storing all this metadata? Ideally it would be something that was compatible with other tools as well. HDF5?

johnmyleswhite · 2013-10-07T15:57:53Z

I think separating Likert scales from other categorical variables might be too specific for something as generic as DataFrames: what functions would apply to them that don't apply to other categorical variables?

I believe domain is meant in the math sense of "allowable, but not necessarily present, values for entries in this column".

We once had grouped columns, but they were dropped because they proved difficult to maintain. They need to added back in, but it takes a good chunk of work to do.

Serialization is kind of a nightmare. I think HDF5 may work, but that's a question for people with more expertise than I have in our current serialization infrastructure.

houshuang · 2013-10-07T16:01:50Z

I was thinking of LoM as user defined, for example I might want to graph
likert-scales differently from a demographic categorical variable... But
this isn't super-important.

Stian

On Mon, Oct 7, 2013 at 11:57 AM, John Myles White
notifications@github.comwrote:

I think separating Likert scales from other categorical variables might be
too specific for something as generic as DataFrames: what functions would
apply to them that don't apply to other categorical variables?

I believe domain is meant in the math sense of "allowable, but not
necessarily present, values for entries in this column".

We once had grouped columns, but they were dropped because they proved
difficult to maintain. They need to added back in, but it takes a good
chunk of work to do.

Serialization is kind of a nightmare. I think HDF5 may work, but that's a
question for people with more expertise than I have in our current
serialization infrastructure.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/35#issuecomment-25820799
.

http://reganmian.net/blog -- Random Stuff that Matters

johnmyleswhite · 2013-10-07T16:04:42Z

You could actually do that already: you'd just make a DataArray{LikertResponse}, where LikertResponse is a custom type. This is one of the virtues of our approach to NA: you can create a DataArray for any type in Julia, not just those we've built into the system.

houshuang · 2013-10-07T16:09:08Z

I think serialization becomes an important issue - many of these things can
probably be done already by subclassing DataFrame etc (and whether it's
better to extend DataFrame or subclass it becomes a design question),
however the key question is how I can setup my data the way I want it (with
full names, groups, etc), and then store it for future analysis by other
scripts...

On Mon, Oct 7, 2013 at 12:04 PM, John Myles White
notifications@github.comwrote:

You could actually do that already: you'd just make a
DataArray{LikertResponse}, where LikertResponse is a custom type. This is
one of the virtues of our approach to NA: you can create a DataArray for
any type in Julia, not just those we've built into the system.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/35#issuecomment-25821376
.

http://reganmian.net/blog -- Random Stuff that Matters

nalimilan · 2013-10-07T16:09:36Z

If you add support for arbitrary meta-data attributes to DataFrames, it will be easy for separate packages to mark some columns as grouped using a group index. No need to hardcode support for every specific feature - just make it easy to extend.

johnmyleswhite · 2013-10-07T17:43:24Z

What would arbitrary metadata consist of? A Dict called metadata that people can do anything with?

nalimilan · 2013-10-07T19:36:57Z

Sure, a Dict containing vectors with one value per column, or even just a DataFrame, since attributes would all have the same length. Only standard attributes would have a pre-specified type, others would be free. Of course setters and getters would make the whole process transparent.

johnmyleswhite · 2013-10-08T00:48:24Z

If you're up for making a demo with that approach, it'd be nice to see. My instinct is that trying to avoid pre-specified types is going to make things slow, but I could be wrong.

tshort · 2013-10-08T01:05:29Z

I like the idea of metadata, but I'm worried that it complicates things, especially if applied to a DataFrame. As John said, a demo would be a great way to work things out. We once had a concept of column groupings that we eventually pulled out because it tended to complicate things. Trying out an implementation is the best way to judge the balance of additional complexity relative to its benefit.

Applying metadata to columns but embedding that data into the DataFrame structure has issues. For example, I may create a DataFrame column that points to a DataArray originally in a different DataFrame like: df1["colX"] = df2["colY"]. If df2 had column labels or other metadata, it would be lost because the DataArray df2["colY"] doesn't know about that. This type of column reuse is common in DataFrames.

It's easier to attach metadata to DataArrays or other column data. Then, the metadata goes with columns. Nothing really needs to change in the DataFrame structure.

nalimilan · 2013-10-08T07:51:11Z

I've never really programmed in Julia yet, so I cannot promise anything...

Tom's point about storing meta-data directly in DataArrays sounds interesting for attributes that make sense when columns are taken in isolation (i.e. for label, unit...). It would not make much sense for column groupings, since a group index taken alone does not mean much. But that may not be an issue: if you take a column out of it's original DataFrame, you know that you're breaking its grouping with other columns.

I kind of like this solution: it means the meta-data would be preserved when passing the DataArray directly to a function, which could happen in many cases.

johnmyleswhite · 2014-01-29T21:22:37Z

Here's the metadata I'm on board with adding permanently:

Nullable: Is this column a Vector or a DataVector? (Note that, if we make the changes described in a recent discussion regarding problems with PDA's never being able to capture all properties of categorical data, we'll only have Vector or DataVector going forward.)
Column label/description: An arbitrarily length string describing the contents of that column in natural language.

Here's the metadata I like, but don't feel comfortable committing to just yet:

Units of measurement: Saying whether a vector is measured in inches or feet or meters seems really awesome, but it seems like it might be done rarely enough that I'm not ready to commit to it just yet. Let's shoot for working this idea out for after the 0.3 release.

FWIW, I'm used to people storing a description of the levels of cryptic enums in the description field of column tables in RDBMS.

HarlanH · 2014-01-29T21:26:13Z

PDAs were originally intended to be a performance/memory optimization, not
(just) a representation for categorical data. I missed the discussion of
their limitations -- would you point me at that?

On Wed, Jan 29, 2014 at 4:22 PM, John Myles White
notifications@github.comwrote:

Here's the metadata I'm on board with adding permanently:

Nullable: Is this column a Vector or a DataVector? (Note that, if we
make the changes described in a recent discussion regarding problems with
PDA's never being able to capture all properties of categorical data, we'll
only have Vector or DataVector going forward.)

Column label/description: An arbitrarily length string describing
the contents of that column in natural language.

Here's the metadata I like, but don't feel comfortable committing to just
yet:

Units of measurement: Saying whether a vector is measured in inches
or feet or meters seems really awesome, but it seems like it might be done
rarely enough that I'm not ready to commit to it just yet. Let's shoot for
working this idea out for after the 0.3 release.

FWIW, I'm used to people storing a description of the levels of cryptic
enums in the description field of column tables in RDBMS.

Reply to this email directly or view it on GitHubhttps://github.com//issues/35#issuecomment-33632269
.

johnmyleswhite · 2014-01-29T21:29:33Z

Agreed that PDA's were an optimization, but they've gotten used as factors.

I wrote about the limitations of PDA's after "an epiphany" described in JuliaStats/DataArrays.jl#50.

Summary: R gets a lot of mileage out of storing information about factor levels in vectors, but that's because each subset (including singleton-elements) retains information about the vector as a whole. Since Julia has proper scalars, factors need to be represented using a new scalar type, which will probably end up looking like Enum's.

nalimilan · 2014-01-29T21:45:02Z

I don't get why we would need a Nullable attribute: shouldn't this be inferred from the type of the column vector (i.e. Array or DataArray)?

Starting with column labels and not supporting units is reasonable. The essential point is to make the system extendable so that new attributes can be added in the future (custom attributes too?).

Finally, there's the question of whether some meta-data should be stored in DataArrays directly. For factors, the levels will have to. Conceptually, a variable label is also attached to the column rather than to the DataFrame. The problem is that standard Arrays do not support meta-data.

tshort · 2014-01-29T21:49:16Z

Regarding "Nullable", can't you just use colwise and extract that from the column type? Arrays can't have missing data and DataArrays can. Actually Arrays could have missing data if the Arrays holds a type that can be an NA. In any case, you should still be able to tell by the type of Array{T,N} using T.

As far as what we store in metadata, maybe we can use a Dict for that to allow storing different fields, and standardize on a few common names.

Regarding where to store the metadata, in this thread above, I outlined adding metadata to the columns. That helps with df[:newcol] = df2[:othercol]. But, what do you do with df[:col] + 1?

So, if we stick metadata in the DataFrame (or Index), we might need a structure that carries the data and metadata to handle the df[:newcol] = df2[:othercol] case. Or, we can just require the user to do metadata(df)[:newcol] = metadata(df2)[:othercol].

johnmyleswhite · 2014-01-29T21:49:56Z

We don't need to store a nullable attribute. We just need to expose that information through an interface. But it might be faster to check a BitVector than to check the type tag of each column. Let's worry about implementation later and focus on design first.

I'm not really ready to embrace custom attributes just yet, since it fragments the community if some people's DataFrames have properties that other DataFrames don't share. Let's think about whether we should have them later.

For factors, the levels of the factor will be stored in the type system, not in a DataArray.

johnmyleswhite · 2014-01-29T21:52:49Z

I don't think we should store metadata in columns. AFAIK, RDBMS systems don't do that: these properties are attached to the specific table and don't come along for the ride with the values in that table. So I'd rather require the user to do metadata(df)[:newcol] = metadata(df2)[:othercol].

tshort · 2014-01-29T21:53:49Z

Sounds good, John.

nalimilan · 2014-01-30T22:02:56Z

I'm not very familiar with database management systems, but it seems to me it would be convenient and completely logical to preserve column labels if you copy a column to another DataFrame, which is what df[:newcol] = df2[:othercol] is about. That said, a special function to copy a column could also be added if needed, which would handle this special case.

A more general issue I'm thinking about is that if meta-data is attached to the DataFrame and not the vector, then an (imaginary) call like plot(df[:col1], df[:col2]) will not be able to access the column label to find a meaningful default axis labels. An interface dedicated to DataFrames will have to be used, something like plot(~ col1 + col2, df). This sounds fine to me (and even better than the first form), but it's worth checking it would work in all common cases.

johnmyleswhite · 2014-01-30T22:04:12Z

I think people should always specify their axes labels manually if they don't want defaults.

nalimilan · 2014-01-30T22:09:30Z

Of course, I don't deny that. I was talking about the impossibility for plot(df[:col1], df[:col2]) to offer reasonable defaults when the user does not set them explicitly.

johnmyleswhite · 2014-01-30T22:29:12Z

That's true. But we'll never offer anything as smooth as R's kind of defaults, where you have access to information about the calling context. I think people can get used to explicitness.

nalimilan · 2014-01-31T10:36:34Z

With column labels, we can actually offer something much more useful than R's defaults. In R most of the time the default axis label is ugly or useless, e.g. df[["datebrth"]] or even x[[3]]. With DataFrames it could be Date of birth instead.

johnmyleswhite · 2014-01-31T16:59:13Z

We can use column labels for interfaces that take in DataFrames as arguments in the way that Gadfly does. For things that work with vectors, they should not assume labels will exist. Otherwise they're broken for normal Arrays.

quinnj · 2017-09-07T03:19:23Z

Do we really want arbitrary metadata in the type? Seems like there are generally other ways to include metadata about your DataFrame w/o stuffing it into the type itself.

nalimilan · 2017-09-07T13:02:07Z

I think we something like that would be useful, even if that's not the highest priority. How could you store metadata about columns without support in DataFrame itself?

pdeffebach · 2017-09-18T03:14:18Z

I agree with the above. If the goal is to have easy plotting (automatic labels), and easy table creation, forcing a long list of packages to interact with a third labeling package would be far more difficult to maintain than incorporating metadata into dataframes.

closes #34 use TimeZones to support R's POSIXct (some non-IANA timezone codes generated by R not supported)

bkamins · 2022-09-20T07:44:56Z

Closed with #3055

HarlanH mentioned this issue Jan 29, 2014

Convert to using only symbols for column names. #509

Merged

nalimilan mentioned this issue Jun 9, 2014

why #618

Closed

nalimilan mentioned this issue May 27, 2018

Add metadata support to DataFrames #1413

Closed

bkamins mentioned this issue Jan 15, 2019

DataFrames.jl roadmap #1678

Closed

31 tasks

bkamins added the non-breaking The proposed change is not breaking label Feb 12, 2020

bkamins added this to the 2.0 milestone Feb 12, 2020

bkamins modified the milestones: 1.x, 1.4 Dec 10, 2021

bkamins mentioned this issue Dec 11, 2021

Add metadata #2961

Closed

bkamins added metadata and removed decision non-breaking The proposed change is not breaking labels May 22, 2022

bkamins mentioned this issue May 22, 2022

Metadata on data frame and column level #3055

Merged

nalimilan pushed a commit that referenced this issue May 26, 2022

Support R Dates and POSIXct (#35)

2de528a

closes #34 use TimeZones to support R's POSIXct (some non-IANA timezone codes generated by R not supported)

bkamins closed this as completed Sep 20, 2022

Metadata for columns and/or DataFrames #35

Metadata for columns and/or DataFrames #35

Comments

tshort commented Jul 19, 2012

HarlanH commented Jul 19, 2012

houshuang commented Oct 4, 2013

johnmyleswhite commented Oct 5, 2013

nalimilan commented Oct 6, 2013

johnmyleswhite commented Oct 6, 2013

nalimilan commented Oct 6, 2013

HarlanH commented Oct 6, 2013

houshuang commented Oct 7, 2013

johnmyleswhite commented Oct 7, 2013

houshuang commented Oct 7, 2013

johnmyleswhite commented Oct 7, 2013

houshuang commented Oct 7, 2013

nalimilan commented Oct 7, 2013

johnmyleswhite commented Oct 7, 2013

nalimilan commented Oct 7, 2013

johnmyleswhite commented Oct 8, 2013

tshort commented Oct 8, 2013

nalimilan commented Oct 8, 2013

johnmyleswhite commented Jan 29, 2014

HarlanH commented Jan 29, 2014

johnmyleswhite commented Jan 29, 2014

nalimilan commented Jan 29, 2014

tshort commented Jan 29, 2014

johnmyleswhite commented Jan 29, 2014

johnmyleswhite commented Jan 29, 2014

tshort commented Jan 29, 2014

nalimilan commented Jan 30, 2014

johnmyleswhite commented Jan 30, 2014

nalimilan commented Jan 30, 2014

johnmyleswhite commented Jan 30, 2014

nalimilan commented Jan 31, 2014

johnmyleswhite commented Jan 31, 2014

quinnj commented Sep 7, 2017

nalimilan commented Sep 7, 2017

pdeffebach commented Sep 18, 2017

bkamins commented Sep 20, 2022