Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata in data.tables #4804

Open
raneameya opened this issue Nov 10, 2020 · 14 comments
Open

Metadata in data.tables #4804

raneameya opened this issue Nov 10, 2020 · 14 comments
Labels
top request One of our most-requested issues

Comments

@raneameya
Copy link

TLDR

I want to avoid column names like Z-normalised temperature (from DB1 before 2018 & from DB2 afterwards), without relying on code comments and in a more native way (e.g. comments can't be saved in a DT.rds file on disk). Hence, request to introduce a metadata function that could retrieve metadata which could have been optionally supplied by the user.

Problem

Pre-modelling, a large amount of time is spent collating and joining data from various sources, transforming some columns to make them just right for the model. Thus, a column may have been transformed multiple number of times, with various edge cases dealt with on a case by case basis.

Current solution

Use descriptive column names or comments. This can easily get unwieldy with column names like Z-normalised temperature (from DB1 before 2018 & from DB2 afterwards). Comments on the other hand, can't be shared as a part of the data.table.

Proposed solution

Introduce a metadata function that contains info stored by the user that optionally describes each column. This would probably also involve a setmetadata function or metadata<- function? Not sure.

@MichaelChirico
Copy link
Member

Seems related: #623

@raneameya
Copy link
Author

raneameya commented Nov 11, 2020

Yes, very related. Have thumbs-upped that issue.

Although setmetadata could be extended beyond labels. Another use case I had in mind was the ability to add formulae.

PCPC = PrivCons / Population (PCPC is a newly created column for Private consumption per capita based on existing columns in the original data.table), where description fields like this could be updated if the column names were changed in the associated data.table.

@jangorecki
Copy link
Member

@gayyaM Not really following how this request extends linked one. Could you please provide example code and expected results?

@raneameya
Copy link
Author

raneameya commented Nov 12, 2020

Sure. Let me know if this is clear. The idea is that setmetadata can be extended beyond just Description & Formula.

library(data.table)
DT <- data.table(x = 1:5)
DT[
  , `:=`(y = 2 * x, z = x * x)
]
md <- list(
  Description = c(x = 'An unknown', y = 'Twice the unknown', z = 'Unknown squared'), 
  Formula     = c(y = '2 * x', z = 'x * x')
)
setmetadata(DT, md)
metadata(DT)
> Description
> x: An unknown
> y: Twice the unknown
> z: Unknown squared
> Formula
> y = 2 * x
> z = x * x
setnames(DT, c('x', 'y', 'z'), c('xx', 'yy', 'zz'))
metadata(DT)
> Description
> xx: An unknown
> yy: Twice the unknown
> zz: Unknown squared
> Formula
> yy = 2 * xx
> zz = xx * xx

@fcocquemas
Copy link
Contributor

I've been using attr() for this purpose. In your example I could do:

attr(DT, "metadata") <- md
attr(DT, "metadata")

You can even attach the attribute to a column:

attr(DT$x, "metadata") <- list(Description = md$Description['x'])
attr(DT$x, "metadata")

Maybe this helps?

@KyleHaynes
Copy link
Contributor

I like the idea @fcocquemas, I've done this in the past, but often find it gets lost after basic operations.

Repex

require(data.table)

# Define some metadata.
md = list(x = "some metadata about iris::Species", y = "some more")

# Coerce iris to a data.table.
DT = data.table(iris)

# Assign some metadata at an variable level
attr(DT$Species, "metadata") = list(Description = md$x)
str(DT)
# Classes ‘data.table’ and 'data.frame':  150 obs. of  5 variables:
#  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
#   ..- attr(*, "metadata")=List of 1
#   .. ..$ Description: chr "some metadata about iris::Species"
#  - attr(*, ".internal.selfref")=<externalptr> 

# Basic coercion. 
DT[, Species := as.character(Species)]
str(DT)
# Classes ‘data.table’ and 'data.frame':  150 obs. of  5 variables:
#  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#  $ Species     : chr  "setosa" "setosa" "setosa" "setosa" ...
#  - attr(*, ".internal.selfref")=<externalptr> 

# Metadata gone :(

@fcocquemas
Copy link
Contributor

Yes, that's fair, when attached to a column, it will disappear when you transform the data. My use case is mostly long-term storage so it's not a big issue.

You could probably overload := to save and reset the attributes after an operation. Or, keeping the metadata attached to the data.table, rather than the column.

@geneorama
Copy link

I've used attr and attributes in the past to try and accomplish this, but I've never found a clean way to keep track of an objects attributes.

There are so many times when this would be useful, especially with factor handling. If (for instance) you have a model for 50 states, it would be nice to automatically apply the same levels to a prediction dataset that has one state that doesn't happen to be Alabama.

@mbacou
Copy link

mbacou commented Jul 23, 2021

Another approach is to use an ancillary data.table as a "codebook", optionally that codebook can be an attribute of your main data.table. It's especially useful when you need to label graphs with human-readable labels instead of variable codes. Over the long-run I've found it somewhat easier to keep track of variable definitions, units, and imputations in an entirely separate object. Unfortunately there's never been any widely used standard to annotate statistical datasets.

My own codes often includes things like:

library(data.table)

dt <- data.table(x = 1:5)
dt[, `:=`(y = 2 * x, z = x * x)]

dt.meta <- fread("
  code, label, unit, type, description
  x, weight, kg, numeric, long description
  y, double weight, kg, numeric, long description with formula
  z, squared weight, kg, numeric, long description with formula
")

Another way is to use Hmisc labelled class, see https://rdrr.io/cran/Hmisc/man/label.html or sjlabelled https://cran.r-project.org/web/packages/sjlabelled/vignettes/labelleddata.html

@myoung3
Copy link
Contributor

myoung3 commented Jul 24, 2021

Might also want to take a look at how labels/format/etc are implemented in the haven package. they're implemented as vectors with attributes but have some helper functions. These objects seem to behave reasonably well as columns in a data.table.

@raneameya
Copy link
Author

That's an interesting approach, @mbacou. I've used the dt.meta approach once in the past. The core feature I found missing in that approach was the automatic linking of column names to the metadata. e.g. if the column names were to updated, then dt.meta would also need to be updated, which sort of defeated me.

@tonyfischetti
Copy link
Contributor

To add my two cents, I've been using the attr method heavily in a package I have to keep track of things like when the data was last updated and a wrapper around fread/fwrite to read/write the date directly from the filename-basename.
It's a little complicated because I have to remember to copy the attributes after certain transformations/re-assignment.
Not sure what a robust solution would look like but it sounds like a potentially really useful idea

@cthombor
Copy link

cthombor commented Nov 25, 2022

I'm now using R and data.table for the first time. My current workaround for storing metadata about an experiment is to define a method that preserves the attributes on a data.table when I add a new set of experimental observations:

#' add a row to a SafeRankExpt object
#'
#' @param object prior results of experimentation
#' @param row    new observations
#'
#' @return updated SafeRankExpt object
rbind.SafeRankExpt <- function(object, row) {
  stopifnot(is.SafeRankExpt(object))
  ao <- attributes(object)
  object = rbind(object, row, use.names = TRUE)
  attributes(object) <- ao
  stopifnot(is.SafeRankExpt(object))
  return(object)
}

BTW I'd happily take suggestions on how to improve my coding style in R. My only prior experience with statistical experimentation was in the early 1990s, using S, to experimentally validate my PRNG package mrandom!

@cthombor
Copy link

AFAIK most of the design-energy around data.table is to support data analysis (especially of very large datasets) and not data collection in an experimental setting (even in cases where large amounts of experimental data is being collected). I can sort-of understand some reasons why the attributes of a data.table object are not reliably preserved across operations which add rows. In this thread, it seems that contributors are not surprised that attributes of the table, or of its columns, are not reliably preserved. But... to avoid newbies like me picking up the "wrong" package for their task, perhaps the lack of attributes on the returned or modified object could be more clearly disclosed in documentation on functions such as rbindlist which seem (in my very limited) experience with a single release seem to reliably return a data.table without any object-level attributes aside from its class?

And... I believe I do understand why a data-analyst would want to have metadata on their data.table objects which describes their provenance -- and that most (but not all) of the most-relevant provenance metadata would be recording the provenance of individual columns rather than of the data.table as a whole. Digital provenance is a deep subject and I can well understand why it'd be a sinkhole to impose any additional structure on column-level metadata ... but (as some have noted in this thread) it's annoying to develop a bespoke structure for encoding the provenance of a column in its attributes only to discover -- late in the game, during debugging, rather than during code-design -- that the attribute of a column in a data.table is not a reliable place to store information about its provenance.

I'll close this with an explanation of why I think the top-level documentation for data.table should warn experimentalists from using this package to store their experimental data. They should instead be steered toward a matrix, at least until such time as data.table is revved so that it is reliably preserving object-level attributes across all operations -- as is the case AFAIK with data.frame and matrix. Uncharitably, my current understanding of data.table is that it has some poorly-documented delete-on-modify semantics, with respect to all attributes other than class. And a glance through its issues/ suggests to me that the preservation of class attributes on columns is an ongoing difficulty for the project team... which doesn't surprise me in the least because the base semantics of R are very complex with respect to when and how class-coercions "should" occur.

Thanks for reading through this long explanation of my newbie-difficulties with data.table! My impression is that it'll be a great package for my future data analysis, even though it was (in hindsight) a very poor choice for data-collection on stochastic (pseudo-random) experimentation.

@MichaelChirico MichaelChirico added the top request One of our most-requested issues label Apr 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
top request One of our most-requested issues
Projects
None yet
Development

No branches or pull requests

10 participants