Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata for columns and/or DataFrames #35

Closed
tshort opened this issue Jul 19, 2012 · 36 comments
Closed

Metadata for columns and/or DataFrames #35

tshort opened this issue Jul 19, 2012 · 36 comments

Comments

@tshort
Copy link
Contributor

tshort commented Jul 19, 2012

Should we leave room for metadata on structures? Frank Harrell's Hmisc package allows units and labels to be attached to data.frame columns.

People may want to attach other metadata like experimenter name or a DataFrame comment.

We could add a meta Dict to the DataFrame, the colindex, and/or at the DataVec level.

@HarlanH
Copy link
Contributor

HarlanH commented Jul 19, 2012

Yes, I think this could be useful. At the DataVec level, we will need meta-data for factor-like behavior (#6). And some of the other things you suggest make sense too. On the other hand, we probably want to rely less on arbitrary attributes, like R, and more on types, when there's the possibility of doing so.

@houshuang
Copy link

I would love to see this, wrote about my wish for better support for things like questionnaires with the code book integrated with the code here (towards the bottom): http://reganmian.net/blog/2013/10/02/likert-graphs-in-r-embedding-metadata-for-easier-plotting/...

Of course also raises the issue about serialization.

@johnmyleswhite
Copy link
Contributor

If we can make this work without performance degradation, I'm in.

@nalimilan
Copy link
Member

Standardizing on a few meta-data attributes like variable label and unit would be wonderful. In R, Harrell's Hmisc offers this feature, but unfortunately very few package use it since it's not standard at all. OTC, SAS has built-in support for variable labels, which are used e.g. to label tables and plot axes automatically. Stata also has this concept, and even allows associating longer "notes" to variables, to explicit their meaning.

More specialized attributes like question names would be useful, if there was an easy way for a separate package to create and use them.

@johnmyleswhite
Copy link
Contributor

Adding units should be trivial, especially if Julia settles on a standard unit package soon. What are the variable labels for: descriptions of the columns to supplement the brief names?

@nalimilan
Copy link
Member

Yeah, variable labels are just the readable, complete name of the variable, as opposed to the abbreviated form that is practical to type (no spaces, no special characters...) but often cryptic and ugly which is used for variable names. The most typical use of variable labels is when you want to provide a good default for axes labels, like "Annual GDP growth", rather than "GDPG". They could also be useful to describe the contents of a database, with a function like Hmisc's describe() [1] or SAS's proc contents.

1: http://www.inside-r.org/packages/cran/Hmisc/docs/describe

@HarlanH
Copy link
Contributor

HarlanH commented Oct 6, 2013

Yes, I agree with all of this. A long, human-readable Name (which could be
leveraged for axes labels by plotting routines), Units, and
Level-of-measurement would be very helpful. Possibly also Domain.

LoM could be very handy for statistical modeling routines and the creation
of appropriate model matrices (or the throwing of warnings). I never
intended PooledDataVector to be equivalent to Factor -- it's a
representational optimization. Would be much better for statistical
routines to look for Nominal or Ordinal types and act appropriately, even
if the underlying type is a non-pooled integer or string.

On Sun, Oct 6, 2013 at 11:54 AM, Milan Bouchet-Valat <
notifications@github.com> wrote:

Yeah, variable labels are just the readable, complete name of the
variable, as opposed to the abbreviated form that is practical to type (no
spaces, no special characters...) but often cryptic and ugly which is used
for variable names. The most typical use of variable labels is when you
want to provide a good default for axes labels, like "Annual GDP growth",
rather than "GDPG". They could also be useful to describe the contents of a
database, with a function like Hmisc's describe() [1] or SAS's proc
contents.

1: http://www.inside-r.org/packages/cran/Hmisc/docs/describe


Reply to this email directly or view it on GitHubhttps://github.com//issues/35#issuecomment-25770479
.

@houshuang
Copy link

All great ideas. long name, LoM (for example I'd love to indicate that something is a likert-item, which is more specific than just categorical), etc. Not sure what is meant by domain? Units of course useful for measurements. An open ended comment field would be great for code book stuff (how data is collected, coded etc) - I could see some great ways of showing this, especially in the web view.

Not sure how this would fit in, but in my R code, I also have the concept of grouping columns - for example having five groups of questions.

Also curious about how we serialize this - we can't just spit this out into CSV again. What's DataFrame's "native" format for storing all this metadata? Ideally it would be something that was compatible with other tools as well. HDF5?

@johnmyleswhite
Copy link
Contributor

I think separating Likert scales from other categorical variables might be too specific for something as generic as DataFrames: what functions would apply to them that don't apply to other categorical variables?

I believe domain is meant in the math sense of "allowable, but not necessarily present, values for entries in this column".

We once had grouped columns, but they were dropped because they proved difficult to maintain. They need to added back in, but it takes a good chunk of work to do.

Serialization is kind of a nightmare. I think HDF5 may work, but that's a question for people with more expertise than I have in our current serialization infrastructure.

@houshuang
Copy link

I was thinking of LoM as user defined, for example I might want to graph
likert-scales differently from a demographic categorical variable... But
this isn't super-important.

Stian

On Mon, Oct 7, 2013 at 11:57 AM, John Myles White
notifications@github.comwrote:

I think separating Likert scales from other categorical variables might be
too specific for something as generic as DataFrames: what functions would
apply to them that don't apply to other categorical variables?

I believe domain is meant in the math sense of "allowable, but not
necessarily present, values for entries in this column".

We once had grouped columns, but they were dropped because they proved
difficult to maintain. They need to added back in, but it takes a good
chunk of work to do.

Serialization is kind of a nightmare. I think HDF5 may work, but that's a
question for people with more expertise than I have in our current
serialization infrastructure.


Reply to this email directly or view it on GitHubhttps://github.com//issues/35#issuecomment-25820799
.

http://reganmian.net/blog -- Random Stuff that Matters

@johnmyleswhite
Copy link
Contributor

You could actually do that already: you'd just make a DataArray{LikertResponse}, where LikertResponse is a custom type. This is one of the virtues of our approach to NA: you can create a DataArray for any type in Julia, not just those we've built into the system.

@houshuang
Copy link

I think serialization becomes an important issue - many of these things can
probably be done already by subclassing DataFrame etc (and whether it's
better to extend DataFrame or subclass it becomes a design question),
however the key question is how I can setup my data the way I want it (with
full names, groups, etc), and then store it for future analysis by other
scripts...

On Mon, Oct 7, 2013 at 12:04 PM, John Myles White
notifications@github.comwrote:

You could actually do that already: you'd just make a
DataArray{LikertResponse}, where LikertResponse is a custom type. This is
one of the virtues of our approach to NA: you can create a DataArray for
any type in Julia, not just those we've built into the system.


Reply to this email directly or view it on GitHubhttps://github.com//issues/35#issuecomment-25821376
.

http://reganmian.net/blog -- Random Stuff that Matters

@nalimilan
Copy link
Member

If you add support for arbitrary meta-data attributes to DataFrames, it will be easy for separate packages to mark some columns as grouped using a group index. No need to hardcode support for every specific feature - just make it easy to extend.

@johnmyleswhite
Copy link
Contributor

What would arbitrary metadata consist of? A Dict called metadata that people can do anything with?

@nalimilan
Copy link
Member

Sure, a Dict containing vectors with one value per column, or even just a DataFrame, since attributes would all have the same length. Only standard attributes would have a pre-specified type, others would be free. Of course setters and getters would make the whole process transparent.

@johnmyleswhite
Copy link
Contributor

If you're up for making a demo with that approach, it'd be nice to see. My instinct is that trying to avoid pre-specified types is going to make things slow, but I could be wrong.

@tshort
Copy link
Contributor Author

tshort commented Oct 8, 2013

I like the idea of metadata, but I'm worried that it complicates things, especially if applied to a DataFrame. As John said, a demo would be a great way to work things out. We once had a concept of column groupings that we eventually pulled out because it tended to complicate things. Trying out an implementation is the best way to judge the balance of additional complexity relative to its benefit.

Applying metadata to columns but embedding that data into the DataFrame structure has issues. For example, I may create a DataFrame column that points to a DataArray originally in a different DataFrame like: df1["colX"] = df2["colY"]. If df2 had column labels or other metadata, it would be lost because the DataArray df2["colY"] doesn't know about that. This type of column reuse is common in DataFrames.

It's easier to attach metadata to DataArrays or other column data. Then, the metadata goes with columns. Nothing really needs to change in the DataFrame structure.

@nalimilan
Copy link
Member

I've never really programmed in Julia yet, so I cannot promise anything...

Tom's point about storing meta-data directly in DataArrays sounds interesting for attributes that make sense when columns are taken in isolation (i.e. for label, unit...). It would not make much sense for column groupings, since a group index taken alone does not mean much. But that may not be an issue: if you take a column out of it's original DataFrame, you know that you're breaking its grouping with other columns.

I kind of like this solution: it means the meta-data would be preserved when passing the DataArray directly to a function, which could happen in many cases.

@johnmyleswhite
Copy link
Contributor

Here's the metadata I'm on board with adding permanently:

  • Nullable: Is this column a Vector or a DataVector? (Note that, if we make the changes described in a recent discussion regarding problems with PDA's never being able to capture all properties of categorical data, we'll only have Vector or DataVector going forward.)
  • Column label/description: An arbitrarily length string describing the contents of that column in natural language.

Here's the metadata I like, but don't feel comfortable committing to just yet:

  • Units of measurement: Saying whether a vector is measured in inches or feet or meters seems really awesome, but it seems like it might be done rarely enough that I'm not ready to commit to it just yet. Let's shoot for working this idea out for after the 0.3 release.

FWIW, I'm used to people storing a description of the levels of cryptic enums in the description field of column tables in RDBMS.

@HarlanH
Copy link
Contributor

HarlanH commented Jan 29, 2014

PDAs were originally intended to be a performance/memory optimization, not
(just) a representation for categorical data. I missed the discussion of
their limitations -- would you point me at that?

On Wed, Jan 29, 2014 at 4:22 PM, John Myles White
notifications@github.comwrote:

Here's the metadata I'm on board with adding permanently:

  • Nullable: Is this column a Vector or a DataVector? (Note that, if we
    make the changes described in a recent discussion regarding problems with
    PDA's never being able to capture all properties of categorical data, we'll
    only have Vector or DataVector going forward.)
  • Column label/description: An arbitrarily length string describing
    the contents of that column in natural language.

Here's the metadata I like, but don't feel comfortable committing to just
yet:

  • Units of measurement: Saying whether a vector is measured in inches
    or feet or meters seems really awesome, but it seems like it might be done
    rarely enough that I'm not ready to commit to it just yet. Let's shoot for
    working this idea out for after the 0.3 release.

FWIW, I'm used to people storing a description of the levels of cryptic
enums in the description field of column tables in RDBMS.

Reply to this email directly or view it on GitHubhttps://github.com//issues/35#issuecomment-33632269
.

@johnmyleswhite
Copy link
Contributor

Agreed that PDA's were an optimization, but they've gotten used as factors.

I wrote about the limitations of PDA's after "an epiphany" described in JuliaStats/DataArrays.jl#50.

Summary: R gets a lot of mileage out of storing information about factor levels in vectors, but that's because each subset (including singleton-elements) retains information about the vector as a whole. Since Julia has proper scalars, factors need to be represented using a new scalar type, which will probably end up looking like Enum's.

@nalimilan
Copy link
Member

I don't get why we would need a Nullable attribute: shouldn't this be inferred from the type of the column vector (i.e. Array or DataArray)?

Starting with column labels and not supporting units is reasonable. The essential point is to make the system extendable so that new attributes can be added in the future (custom attributes too?).

Finally, there's the question of whether some meta-data should be stored in DataArrays directly. For factors, the levels will have to. Conceptually, a variable label is also attached to the column rather than to the DataFrame. The problem is that standard Arrays do not support meta-data.

@tshort
Copy link
Contributor Author

tshort commented Jan 29, 2014

Regarding "Nullable", can't you just use colwise and extract that from the column type? Arrays can't have missing data and DataArrays can. Actually Arrays could have missing data if the Arrays holds a type that can be an NA. In any case, you should still be able to tell by the type of Array{T,N} using T.

As far as what we store in metadata, maybe we can use a Dict for that to allow storing different fields, and standardize on a few common names.

Regarding where to store the metadata, in this thread above, I outlined adding metadata to the columns. That helps with df[:newcol] = df2[:othercol]. But, what do you do with df[:col] + 1?

So, if we stick metadata in the DataFrame (or Index), we might need a structure that carries the data and metadata to handle the df[:newcol] = df2[:othercol] case. Or, we can just require the user to do metadata(df)[:newcol] = metadata(df2)[:othercol].

@johnmyleswhite
Copy link
Contributor

We don't need to store a nullable attribute. We just need to expose that information through an interface. But it might be faster to check a BitVector than to check the type tag of each column. Let's worry about implementation later and focus on design first.

I'm not really ready to embrace custom attributes just yet, since it fragments the community if some people's DataFrames have properties that other DataFrames don't share. Let's think about whether we should have them later.

For factors, the levels of the factor will be stored in the type system, not in a DataArray.

@johnmyleswhite
Copy link
Contributor

I don't think we should store metadata in columns. AFAIK, RDBMS systems don't do that: these properties are attached to the specific table and don't come along for the ride with the values in that table. So I'd rather require the user to do metadata(df)[:newcol] = metadata(df2)[:othercol].

@tshort
Copy link
Contributor Author

tshort commented Jan 29, 2014

Sounds good, John.

@nalimilan
Copy link
Member

I'm not very familiar with database management systems, but it seems to me it would be convenient and completely logical to preserve column labels if you copy a column to another DataFrame, which is what df[:newcol] = df2[:othercol] is about. That said, a special function to copy a column could also be added if needed, which would handle this special case.

A more general issue I'm thinking about is that if meta-data is attached to the DataFrame and not the vector, then an (imaginary) call like plot(df[:col1], df[:col2]) will not be able to access the column label to find a meaningful default axis labels. An interface dedicated to DataFrames will have to be used, something like plot(~ col1 + col2, df). This sounds fine to me (and even better than the first form), but it's worth checking it would work in all common cases.

@johnmyleswhite
Copy link
Contributor

I think people should always specify their axes labels manually if they don't want defaults.

@nalimilan
Copy link
Member

Of course, I don't deny that. I was talking about the impossibility for plot(df[:col1], df[:col2]) to offer reasonable defaults when the user does not set them explicitly.

@johnmyleswhite
Copy link
Contributor

That's true. But we'll never offer anything as smooth as R's kind of defaults, where you have access to information about the calling context. I think people can get used to explicitness.

@nalimilan
Copy link
Member

With column labels, we can actually offer something much more useful than R's defaults. In R most of the time the default axis label is ugly or useless, e.g. df[["datebrth"]] or even x[[3]]. With DataFrames it could be Date of birth instead.

@johnmyleswhite
Copy link
Contributor

We can use column labels for interfaces that take in DataFrames as arguments in the way that Gadfly does. For things that work with vectors, they should not assume labels will exist. Otherwise they're broken for normal Arrays.

@nalimilan nalimilan mentioned this issue Jun 9, 2014
Closed
@quinnj
Copy link
Member

quinnj commented Sep 7, 2017

Do we really want arbitrary metadata in the type? Seems like there are generally other ways to include metadata about your DataFrame w/o stuffing it into the type itself.

@nalimilan
Copy link
Member

I think we something like that would be useful, even if that's not the highest priority. How could you store metadata about columns without support in DataFrame itself?

@pdeffebach
Copy link
Contributor

I agree with the above. If the goal is to have easy plotting (automatic labels), and easy table creation, forcing a long list of packages to interact with a third labeling package would be far more difficult to maintain than incorporating metadata into dataframes.

@bkamins bkamins mentioned this issue Jan 15, 2019
31 tasks
@bkamins bkamins added the non-breaking The proposed change is not breaking label Feb 12, 2020
@bkamins bkamins added this to the 2.0 milestone Feb 12, 2020
@bkamins bkamins modified the milestones: 1.x, 1.4 Dec 10, 2021
@bkamins bkamins mentioned this issue Dec 11, 2021
@bkamins bkamins added metadata and removed decision non-breaking The proposed change is not breaking labels May 22, 2022
nalimilan pushed a commit that referenced this issue May 26, 2022
closes #34

use TimeZones to support R's POSIXct (some non-IANA timezone codes generated by R not supported)
@bkamins
Copy link
Member

bkamins commented Sep 20, 2022

Closed with #3055

@bkamins bkamins closed this as completed Sep 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants