Provenance tracking using semantic web technologies? #228

DamienIrving · 2021-05-24T04:43:09Z

In recent years, Semantic Web technologies have been used to record the data processing steps involved in producing climate products (i.e. maps, plots or any other climate research outcome stored in a file).

While Python packages such as rook and ESMValTool simply define their own bespoke / narrow adaptation of the PROV / RDF data model to suit their own needs, there have been attempts to define a comprehensive ontology for climate products (e.g. Bedia et al 2019, Zhang et al., 2020).

As far as I can tell, the most widely used ontology is METACLIP (METAdata for CLImate Products; see their website, flyer and paper), which was initially developed for the Copernicus QA4Seas seasonal forecasting project and is now also being used for the VALUE downscaling initiative. The METACLIP developers work in R and have integrated their approach to provenance tracking into the climate4R package.

I'm wondering if there's any interest in trying to figure out how to incorporate provenance tracking into xarray? There isn't a Python implementation of METACLIP yet, but presumably one could implement the ontology using rdflib. @huard suggested cf-xarray might be a good place to start the conversation around this, but happy to move the discussion elsewhere if it would be more appropriate?

The text was updated successfully, but these errors were encountered:

dcherian · 2021-05-25T01:48:24Z

@DamienIrving thanks for opening this thoughtful issue.

I'm wondering if there's any interest in trying to figure out how to incorporate provenance tracking into xarray?

Yes absolutely.

but happy to move the discussion elsewhere if it would be more appropriate?

I think this should be discussed in a new issue on the xarray github. It will get a lot more visibility and thoughtful input. Assuming that provenance could be tracked through a fancy attr, it will also tie in to the discussions at pydata/xarray#988, pydata/xarray#4896, and pydata/xarray#3891 (comment)

I think users would want to opt-in to provenance tracking at the xarray level rather than only when calling functions through the .cf namespace. But cf-xarray or some other package could provide a function that handles incoming provenance attributes and returns an appropriately merged set of attributes.

huard · 2021-05-25T13:40:13Z

@dcherian I agree there is value in getting input from the xarray crowd, and I agree provenance should be supported at the xarray level. However, I'm concerned that finding an all-purpose solution that works for every discipline will end up in an infinite github comment thread. I'd like to first define what the CF community wants, and then carry this proposal to xarray for feedback.

Based on discussions at the 2020 CF meeting and private discussions with provenance pros, here would be some development guidelines:

provenance is logged in an external file, linked in the CF attributes (e.g. has_provenance = <link to provenance file>)
provenance metadata is injected through external templates that can be modified by users. (https://lucmoreau.wordpress.com/2015/07/30/provtoolbox-tutorial-4-templates-for-provenance-part-1/)
provenance information should be machine readable and use existing, standard, ontologies.

dcherian · 2021-05-25T16:12:38Z

I'm concerned that finding an all-purpose solution that works for every discipline will end up in an infinite github comment thread.

Fair point, but I think it would still be valuable to figure out how to do this technically. Xarray is unlikely to provide provenance handling directly so we'll have to figure out how to hook in to it. The biweekly xarray dev meeting is tomorrow (Wednesday @ 9.30am Mountain Time). @huard & @DamienIrving, if you can, I think it would be good for you to attend and raise this issue there. If not, the meeting repeats every two weeks... We can use it to drive the conversation forward.

I'd like to first define what the CF community wants,

👍 but the audience on this repo is really tiny ! Maybe https://discourse.pangeo.io/?

huard · 2021-05-25T16:22:46Z

Makes sense, will try those suggestions. Thanks !

huard · 2021-05-25T20:26:31Z

https://discourse.pangeo.io/t/tracking-provenance-in-xarray/1510

DamienIrving · 2021-05-26T00:10:30Z

Thanks, @dcherian. The xarray dev meeting is 1:30am Australian Eastern Standard Time so I won't be able to make it, but if you're available @huard that sounds like a good idea?

huard · 2021-05-26T16:06:09Z

Hi @DamienIrving
I attended the meeting and presented the issue. There's an open PR that could possibly be used a building block to track provenance: pydata/xarray#4896

dcherian · 2021-05-26T17:36:11Z

So... I'm thinking that cf_xarray can provide two functions: track_provenance and track_history that users could provide to that combine_attrs setting.

jbusecke · 2021-05-26T17:39:48Z

Just stumbled upon this on discourse, and I was wondering if xgcm (xgcm/xgcm#143) could piggy back on these efforts?

dcherian · 2021-05-26T17:45:00Z

Ah I guess another angle is how projects like xgcm can plug in to the provenance tracking infrastructure. xgcm could easily set the history attribute in each method today.

huard · 2021-05-26T19:01:21Z

@dcherian I like this idea. track_history would be the human-readable version, directly embedded in the object's attributes, and track_provenance the machine-readable version stored in an external resource (file or db).

@jbusecke I think the spirit is to have xarray expose a hook, and then let libraries configure those hooks to do the actual provenance tracking. I see no reason why xgcm couldn't implement its own provenance mechanism, built on or in parallel with cf-xarray using the same hook.

jbusecke · 2021-05-26T19:16:26Z

This sounds great. Please keep me posted if there is anything I can help with or test on the xgcm side.

dcherian · 2021-08-03T17:19:22Z

See #253. Opinions are welcome!

cjauvin · 2021-08-17T01:16:18Z

Hi all, I work with @huard and last week he asked me to take a stab at this problem.

I read the METACLIP paper, and also carefully studied @dcherian's PR, and by assuming that the mechanism (and final outcome) that the paper describes is mostly in line with what is discussed in this thread, I'm not sure I see (yet) how something similar could be achieved using the proposed extension.

My understanding of the METACLIP system is that its crucial feature is a specialized module (metaclipR) which is used to explicitly build the underlying RDF graph of the metadata, in an incremental way. It is explicit in the sense that the metadata defining steps are interspersed between the "real" logic calls (the ones computing the data product for which we're interested in tracking the provenance).

So I'm wondering: could we solve this problem without such an extra component (i.e. something akin to metaclipR)? Could we do it by using an attribute-based and/or introspection mechanism only, along the lines of what @dcherian proposes?

huard · 2021-08-17T16:26:36Z

Played with the PR, works well. A couple of thoughts.

The context object is fairly thin at the moment. If we want to print the function call signature, I believe we'll need
- the function object (not just the name) so we can inspect it;
- self, so we can extract metadata from the DataArray being operated on;
- other function arguments.
How do we store the provenance document ?
- Could be a prov.ProvDocument object stored in attrs['has_provenance']. When to_netcdf is called, this could be serialized and saved in an external file.
- Could a path saved in attrs['has_provenance'] storing a serialized representation of the provenance graph. Each time the provenance tracker is called, it opens the file, adds nodes, serializes it and saves it back to disk.

dcherian · 2021-08-17T20:19:47Z

Hello @cjauvin excited to see you try this out

the metadata defining steps are interspersed between the "real" logic calls

Basically xarray won't explicitly add these metadata tracking steps but provides a hook where a custom function could do that.

could we solve this problem without such an extra component (i.e. something akin to metaclipR)

I don't think so? It seems like there's a need for pymetaclip or something like that to actually do the hard work.

self, so we can extract metadata from the DataArray being operated on;

So you want self and not just self.attrs? What metadata are you extracting? I think the rest is planned, just not implemented yet.

Each time the provenance tracker is called, it opens the file, adds nodes, serializes it and saves it back to disk.

Is this really needed at every function call or only when a result is written to disk?

huard · 2021-08-17T20:36:06Z

What metadata are you extracting?

I believe we will need global dataset attributes (source, institution, contact, pid, etc), not just DataArray attributes.

Is this really needed at every function call or only when a result is written to disk?

Probably not. But then attrs['has_provenance'] would be an in-memory provenance graph object, and we'd need to delegate to the to_netcdf function the responsibility of serializing this to a file, or manually do something like:

prov_fn = "prov.json"
serialize(da.attrs['has_provenance'], prov_fn)
da.attrs['has_provenance'] = prov_fn

pagecp · 2021-08-18T14:12:52Z

I agree very much with what you suggested in this thread @huard , about global attributes, on the 2-level information type (human readable and machine oriented with more details), and also on the suggestion for the has_provenance attribute, ...

dcherian · 2021-08-18T14:58:54Z

I believe we will need global dataset attributes (source, institution, contact, pid, etc), not just DataArray attributes.

Hmmm.. this is hard to fit in with xarray's design. But do you just need these when you "initialize" the provenance object?

ds = xr.open_dataset(...)

# this loops through and creates a new provenance object for each DataArray, 
# using both DataArray and Dataset attributes
ds = init_provenance(ds)

If we can do that, is it correct that ds["array"].mean() can update the provenance object appropriately on ds.array without having to need access to ds.attrs.

huard · 2021-08-18T15:39:25Z

But do you just need these when you "initialize" the provenance object?

We'd need these anytime a Dataset/DataArray is included in the provenance graph.

If we can do that, is it correct that ds["array"].mean() can update the provenance object appropriately on ds.array without having to need access to ds.attrs.

Because you would attach the provenance object to each DataArray with the init_provenance call ?

dcherian · 2021-09-01T01:27:43Z

Because you would attach the provenance object to each DataArray with the init_provenance call ?

Yes that's what I was thinking but I have no experience in this area.

It would be interesting to see a simple working example that shows the steps required to record all necessary provenance information.

huard · 2021-09-03T19:02:29Z

It would be interesting to see a simple working example that shows the steps required to record all necessary provenance information.

After considerable struggle, here is what I could come up with: #259
It does not record all necessary provenance information, just very basic stuff as an example.

DamienIrving changed the title ~~Provenance tracking using sematic web technologies?~~ Provenance tracking using semantic web technologies? May 24, 2021

keewis mentioned this issue Aug 3, 2021

create the context objects passed to custom combine_attrs functions pydata/xarray#5668

Open

5 tasks

huard mentioned this issue Sep 3, 2021

Prototype for provenance tracking mechanism #259

Open

5 tasks

JessicaS11 mentioned this issue Oct 15, 2021

Add ICESat-2 data read-in functionality icesat2py/icepyx#222

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provenance tracking using semantic web technologies? #228

Provenance tracking using semantic web technologies? #228

DamienIrving commented May 24, 2021 •

edited

Loading

dcherian commented May 25, 2021

huard commented May 25, 2021 •

edited

Loading

dcherian commented May 25, 2021

huard commented May 25, 2021

huard commented May 25, 2021

DamienIrving commented May 26, 2021

huard commented May 26, 2021

dcherian commented May 26, 2021

jbusecke commented May 26, 2021

dcherian commented May 26, 2021

huard commented May 26, 2021

jbusecke commented May 26, 2021

dcherian commented Aug 3, 2021

cjauvin commented Aug 17, 2021

huard commented Aug 17, 2021

dcherian commented Aug 17, 2021

huard commented Aug 17, 2021

pagecp commented Aug 18, 2021

dcherian commented Aug 18, 2021

huard commented Aug 18, 2021

dcherian commented Sep 1, 2021

huard commented Sep 3, 2021 •

edited

Loading

Provenance tracking using semantic web technologies? #228

Provenance tracking using semantic web technologies? #228

Comments

DamienIrving commented May 24, 2021 • edited Loading

dcherian commented May 25, 2021

huard commented May 25, 2021 • edited Loading

dcherian commented May 25, 2021

huard commented May 25, 2021

huard commented May 25, 2021

DamienIrving commented May 26, 2021

huard commented May 26, 2021

dcherian commented May 26, 2021

jbusecke commented May 26, 2021

dcherian commented May 26, 2021

huard commented May 26, 2021

jbusecke commented May 26, 2021

dcherian commented Aug 3, 2021

cjauvin commented Aug 17, 2021

huard commented Aug 17, 2021

dcherian commented Aug 17, 2021

huard commented Aug 17, 2021

pagecp commented Aug 18, 2021

dcherian commented Aug 18, 2021

huard commented Aug 18, 2021

dcherian commented Sep 1, 2021

huard commented Sep 3, 2021 • edited Loading

DamienIrving commented May 24, 2021 •

edited

Loading

huard commented May 25, 2021 •

edited

Loading

huard commented Sep 3, 2021 •

edited

Loading