Metadata in data.tables #4804

raneameya · 2020-11-10T21:40:38Z

TLDR

I want to avoid column names like Z-normalised temperature (from DB1 before 2018 & from DB2 afterwards), without relying on code comments and in a more native way (e.g. comments can't be saved in a DT.rds file on disk). Hence, request to introduce a metadata function that could retrieve metadata which could have been optionally supplied by the user.

Problem

Pre-modelling, a large amount of time is spent collating and joining data from various sources, transforming some columns to make them just right for the model. Thus, a column may have been transformed multiple number of times, with various edge cases dealt with on a case by case basis.

Current solution

Use descriptive column names or comments. This can easily get unwieldy with column names like Z-normalised temperature (from DB1 before 2018 & from DB2 afterwards). Comments on the other hand, can't be shared as a part of the data.table.

Proposed solution

Introduce a metadata function that contains info stored by the user that optionally describes each column. This would probably also involve a setmetadata function or metadata<- function? Not sure.

The text was updated successfully, but these errors were encountered:

MichaelChirico · 2020-11-11T03:07:07Z

Seems related: #623

raneameya · 2020-11-11T20:35:04Z

Yes, very related. Have thumbs-upped that issue.

Although setmetadata could be extended beyond labels. Another use case I had in mind was the ability to add formulae.

PCPC = PrivCons / Population (PCPC is a newly created column for Private consumption per capita based on existing columns in the original data.table), where description fields like this could be updated if the column names were changed in the associated data.table.

jangorecki · 2020-11-12T07:43:30Z

@gayyaM Not really following how this request extends linked one. Could you please provide example code and expected results?

raneameya · 2020-11-12T21:14:26Z

Sure. Let me know if this is clear. The idea is that setmetadata can be extended beyond just Description & Formula.

library(data.table)
DT <- data.table(x = 1:5)
DT[
  , `:=`(y = 2 * x, z = x * x)
]
md <- list(
  Description = c(x = 'An unknown', y = 'Twice the unknown', z = 'Unknown squared'), 
  Formula     = c(y = '2 * x', z = 'x * x')
)
setmetadata(DT, md)
metadata(DT)
> Description
> x: An unknown
> y: Twice the unknown
> z: Unknown squared
> Formula
> y = 2 * x
> z = x * x
setnames(DT, c('x', 'y', 'z'), c('xx', 'yy', 'zz'))
metadata(DT)
> Description
> xx: An unknown
> yy: Twice the unknown
> zz: Unknown squared
> Formula
> yy = 2 * xx
> zz = xx * xx

fcocquemas · 2020-12-15T01:15:24Z

I've been using attr() for this purpose. In your example I could do:

attr(DT, "metadata") <- md
attr(DT, "metadata")

You can even attach the attribute to a column:

attr(DT$x, "metadata") <- list(Description = md$Description['x'])
attr(DT$x, "metadata")

Maybe this helps?

KyleHaynes · 2020-12-15T06:27:45Z

I like the idea @fcocquemas, I've done this in the past, but often find it gets lost after basic operations.

Repex

require(data.table)

# Define some metadata.
md = list(x = "some metadata about iris::Species", y = "some more")

# Coerce iris to a data.table.
DT = data.table(iris)

# Assign some metadata at an variable level
attr(DT$Species, "metadata") = list(Description = md$x)
str(DT)
# Classes ‘data.table’ and 'data.frame':  150 obs. of  5 variables:
#  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
#   ..- attr(*, "metadata")=List of 1
#   .. ..$ Description: chr "some metadata about iris::Species"
#  - attr(*, ".internal.selfref")=<externalptr> 

# Basic coercion. 
DT[, Species := as.character(Species)]
str(DT)
# Classes ‘data.table’ and 'data.frame':  150 obs. of  5 variables:
#  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
#  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
#  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
#  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
#  $ Species     : chr  "setosa" "setosa" "setosa" "setosa" ...
#  - attr(*, ".internal.selfref")=<externalptr> 

# Metadata gone :(

fcocquemas · 2020-12-15T15:22:32Z

Yes, that's fair, when attached to a column, it will disappear when you transform the data. My use case is mostly long-term storage so it's not a big issue.

You could probably overload := to save and reset the attributes after an operation. Or, keeping the metadata attached to the data.table, rather than the column.

geneorama · 2021-07-14T16:33:38Z

I've used attr and attributes in the past to try and accomplish this, but I've never found a clean way to keep track of an objects attributes.

There are so many times when this would be useful, especially with factor handling. If (for instance) you have a model for 50 states, it would be nice to automatically apply the same levels to a prediction dataset that has one state that doesn't happen to be Alabama.

mbacou · 2021-07-23T22:56:44Z

Another approach is to use an ancillary data.table as a "codebook", optionally that codebook can be an attribute of your main data.table. It's especially useful when you need to label graphs with human-readable labels instead of variable codes. Over the long-run I've found it somewhat easier to keep track of variable definitions, units, and imputations in an entirely separate object. Unfortunately there's never been any widely used standard to annotate statistical datasets.

My own codes often includes things like:

library(data.table)

dt <- data.table(x = 1:5)
dt[, `:=`(y = 2 * x, z = x * x)]

dt.meta <- fread("
  code, label, unit, type, description
  x, weight, kg, numeric, long description
  y, double weight, kg, numeric, long description with formula
  z, squared weight, kg, numeric, long description with formula
")

Another way is to use Hmisc labelled class, see https://rdrr.io/cran/Hmisc/man/label.html or sjlabelled https://cran.r-project.org/web/packages/sjlabelled/vignettes/labelleddata.html

myoung3 · 2021-07-24T19:44:00Z

Might also want to take a look at how labels/format/etc are implemented in the haven package. they're implemented as vectors with attributes but have some helper functions. These objects seem to behave reasonably well as columns in a data.table.

raneameya · 2021-07-28T00:25:34Z

That's an interesting approach, @mbacou. I've used the dt.meta approach once in the past. The core feature I found missing in that approach was the automatic linking of column names to the metadata. e.g. if the column names were to updated, then dt.meta would also need to be updated, which sort of defeated me.

tonyfischetti · 2021-08-03T14:30:21Z

To add my two cents, I've been using the attr method heavily in a package I have to keep track of things like when the data was last updated and a wrapper around fread/fwrite to read/write the date directly from the filename-basename.
It's a little complicated because I have to remember to copy the attributes after certain transformations/re-assignment.
Not sure what a robust solution would look like but it sounds like a potentially really useful idea

cthombor · 2022-11-25T22:30:53Z

I'm now using R and data.table for the first time. My current workaround for storing metadata about an experiment is to define a method that preserves the attributes on a data.table when I add a new set of experimental observations:

#' add a row to a SafeRankExpt object
#'
#' @param object prior results of experimentation
#' @param row    new observations
#'
#' @return updated SafeRankExpt object
rbind.SafeRankExpt <- function(object, row) {
  stopifnot(is.SafeRankExpt(object))
  ao <- attributes(object)
  object = rbind(object, row, use.names = TRUE)
  attributes(object) <- ao
  stopifnot(is.SafeRankExpt(object))
  return(object)
}

BTW I'd happily take suggestions on how to improve my coding style in R. My only prior experience with statistical experimentation was in the early 1990s, using S, to experimentally validate my PRNG package mrandom!

cthombor · 2022-11-26T07:52:17Z

AFAIK most of the design-energy around data.table is to support data analysis (especially of very large datasets) and not data collection in an experimental setting (even in cases where large amounts of experimental data is being collected). I can sort-of understand some reasons why the attributes of a data.table object are not reliably preserved across operations which add rows. In this thread, it seems that contributors are not surprised that attributes of the table, or of its columns, are not reliably preserved. But... to avoid newbies like me picking up the "wrong" package for their task, perhaps the lack of attributes on the returned or modified object could be more clearly disclosed in documentation on functions such as rbindlist which seem (in my very limited) experience with a single release seem to reliably return a data.table without any object-level attributes aside from its class?

And... I believe I do understand why a data-analyst would want to have metadata on their data.table objects which describes their provenance -- and that most (but not all) of the most-relevant provenance metadata would be recording the provenance of individual columns rather than of the data.table as a whole. Digital provenance is a deep subject and I can well understand why it'd be a sinkhole to impose any additional structure on column-level metadata ... but (as some have noted in this thread) it's annoying to develop a bespoke structure for encoding the provenance of a column in its attributes only to discover -- late in the game, during debugging, rather than during code-design -- that the attribute of a column in a data.table is not a reliable place to store information about its provenance.

I'll close this with an explanation of why I think the top-level documentation for data.table should warn experimentalists from using this package to store their experimental data. They should instead be steered toward a matrix, at least until such time as data.table is revved so that it is reliably preserving object-level attributes across all operations -- as is the case AFAIK with data.frame and matrix. Uncharitably, my current understanding of data.table is that it has some poorly-documented delete-on-modify semantics, with respect to all attributes other than class. And a glance through its issues/ suggests to me that the preservation of class attributes on columns is an ongoing difficulty for the project team... which doesn't surprise me in the least because the base semantics of R are very complex with respect to when and how class-coercions "should" occur.

Thanks for reading through this long explanation of my newbie-difficulties with data.table! My impression is that it'll be a great package for my future data analysis, even though it was (in hindsight) a very poor choice for data-collection on stochastic (pseudo-random) experimentation.

MichaelChirico mentioned this issue Mar 16, 2024

Master list of most-requested issues #3189

Open

76 tasks

MichaelChirico added the top request One of our most-requested issues label Apr 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metadata in data.tables #4804

Metadata in data.tables #4804

raneameya commented Nov 10, 2020

MichaelChirico commented Nov 11, 2020

raneameya commented Nov 11, 2020 •

edited

Loading

jangorecki commented Nov 12, 2020

raneameya commented Nov 12, 2020 •

edited

Loading

fcocquemas commented Dec 15, 2020

KyleHaynes commented Dec 15, 2020

fcocquemas commented Dec 15, 2020

geneorama commented Jul 14, 2021

mbacou commented Jul 23, 2021 •

edited

Loading

myoung3 commented Jul 24, 2021 •

edited

Loading

raneameya commented Jul 28, 2021

tonyfischetti commented Aug 3, 2021

cthombor commented Nov 25, 2022 •

edited

Loading

cthombor commented Nov 26, 2022

Metadata in data.tables #4804

Metadata in data.tables #4804

Comments

raneameya commented Nov 10, 2020

TLDR

Problem

Current solution

Proposed solution

MichaelChirico commented Nov 11, 2020

raneameya commented Nov 11, 2020 • edited Loading

jangorecki commented Nov 12, 2020

raneameya commented Nov 12, 2020 • edited Loading

fcocquemas commented Dec 15, 2020

KyleHaynes commented Dec 15, 2020

fcocquemas commented Dec 15, 2020

geneorama commented Jul 14, 2021

mbacou commented Jul 23, 2021 • edited Loading

myoung3 commented Jul 24, 2021 • edited Loading

raneameya commented Jul 28, 2021

tonyfischetti commented Aug 3, 2021

cthombor commented Nov 25, 2022 • edited Loading

cthombor commented Nov 26, 2022

raneameya commented Nov 11, 2020 •

edited

Loading

raneameya commented Nov 12, 2020 •

edited

Loading

mbacou commented Jul 23, 2021 •

edited

Loading

myoung3 commented Jul 24, 2021 •

edited

Loading

cthombor commented Nov 25, 2022 •

edited

Loading