xarray-style "attrs", global and per-record field #1391

jpivarski · 2022-03-31T22:44:29Z

Description of new feature

@philippemiron is converting data from NetCDF4 files into Awkward Arrays, and one of the features we lack is a place to put attributes. These can be descriptions, units, meanings of flags, etc., and a single field can have more than one of these (e.g. so that "units" are programmatically accessible). In general, any JSON-encodable data.

This applies at two levels: globally for a whole array, in such a way that all derived arrays pass on the attributes, and per-record field. The per-record field attributes should only be passed on if the meaning of the field is not changed, and should not be counted as part of the Content node's type. Also, these should be read and written to files as metadata wherever possible.

Why not use parameter __doc__?

This would work as a per-record field attribute that is passed on whenever the meaning of the field is not changed, and doesn't count as part of the Content node's type. However, it has to be a string, since this is what goes into the Python __doc__ property (and therefore IPython and Jupyter help). The attributes have to be general JSON-encodable metadata.

Also, this only encodes per-record field attributes, not global attributes.

Why not use a new parameter?

It would only be accessible through idioms like array.layout.parameters["attrs"]. This is high-level data analyst information, and they shouldn't have to go through layout, which is mid-level, for library developers. Also, such an idiom would fail if the layout ever gets buried in another Content node (i.e. the user has to be aware of layouts and how they change, which is not a high-level view!). For example, rearranging records in a RecordArray nests the RecordArray within an IndexedArray for performance reasons; so after certain kinds of slices, array.layout.parameters["attrs"] becomes array.layout.content.parameters["attrs"], probably unexpectedly.

Why not use behavior?

It's an odd thing to do, but behavior is passed from an ak.Array to any new ak.Array derived from it. Any keys in behavior that aren't recognized are ignored. However, this would only encode global attributes, not per-record field attributes, and behavior (which typically contains function objects and class objects) is not serialized when writing files, or filled when reading files.

What instead?

This is really two new features:

Per-record field metadata, as a parameter, that follows all the same rules as __doc__, but doesn't have to be a string and is exposed at high-level in a different way (one that allows non-strings). Parameters are already serializable.
Global metadata that follows all the same rules as behavior, but is serializable.

They should have names like attrs, following xarray's convention, but the per-record field attrs should be distinguishable somehow from the global attrs.

Perhaps both of these should be dicts (JSON objects) and when we're looking at one field, the two dicts are overlaid, with the per-field keys taking precedence over the global keys? That sounds natural and would minimize the use of names, but it sounds like just the sort of thing that would break something in the future.

What does xarray do?

>>> import xarray as xr
>>> dataarray = xr.DataArray([1, 2, 3], attrs={"wow": "look!"})
>>> dataset = xr.Dataset({"x": dataarray}, attrs={"wow": "wee!", "another": "thing"})
>>> dataset.attrs
{'wow': 'wee!', 'another': 'thing'}
>>> dataset.x.attrs
{'wow': 'look!'}

xarray never has this issue because DataArray and Dataset are different types and can never be confused. In Awkward, an ak.Array is an ak.Array is an ak.Array, regardless of whether it's an array of lists of records, an array of just the records, or an array of one of the fields of those records. The global attrs can propagate down when extracting the array of records from the array of lists of records, but when you get to the array of one of the fields of those records, there's a conflict. They should probably be named differently, since "per-field attributes" and "global attributes" are different concepts, but which one gets the exalted name of "attrs"? And what would the other one be called?

The text was updated successfully, but these errors were encountered:

agoose77 · 2022-11-12T14:09:14Z

I was thinking about this, and the per-field attributes could be implemented using parameters; we just need to provide the appropriate high-level API for this. I imagine this in the same way that the string abstraction is a specialisation of the behaviour system, we could do the same for attrs. However, a drawback of this method is that we don't retain the benefits of field-association, i.e. with_field(array, None, 'field') wouldn't update the attr metadata unless we implemented logic for this.

Saying this, in general, I prefer the idea of a dedicated per-field metadata object rather than piggybacking an existing mechanism. If we redesigned awkward from scratch, I'd be tempted to do the same for the __array__ behavior mechanism such that parameters only contains user-metadata.

Maybe the name of this mechanism is field_parameters? I'll give it some more thought...

agoose77 · 2022-11-15T17:35:29Z

Jim and I discussed this - we'll move this to another release in favour of more pressing issues. This will some require some thought.

agoose77 · 2023-02-01T15:57:10Z

We spoke about this today, and in-general, parameters added to a RecordArray should survive most operations, unless we wouldn't want them to anyway. So, we shouldn't need a new place for Xarray metadata here, it can sit under an __xarray__ name. Meanwhile, although the ak.behavior dict is very close to what we want for a metadata object, it's not a perfect fit:

metadata should be JSON serialisable
metadata should be stored in parquet, etc.

Although we can still use behavior for this, it might be clearer to explicitly add a new metadata mechanism that enforces all of these rules instead of piggy-backing onto the existing behaviour mechanism.

I believe this would motivate replacing ak._util.wrap and ak.to_layout usage internally with a forward/backwards deconstruction mechanism, e.g. ak.compose(...) and ak.decompose(...) that pulls out the (layout, metadata, behavior).

Finally, we likely also want a mechanism for resolving parameters within a dimension.

This metadata should be per-array, explicitly.

jpivarski · 2023-06-02T15:12:57Z

As discussed in CoffeaTeam/coffea#824, there's a use-case for wanting a local storage on ak.Array (in the meta of a dak.Array) that

is propagated to derived arrays, just as behavior is
is appropriate for non-behaviors (Coffea wants to store whole arrays there)
is not serialized in any way: pickle, Parquet, etc.

That use-case shares desirata (1) and (2) with the attrs requested here. The attrs requested here need to be more persistent than behavior, to be saved not only in pickle but also file formats like Parquet (i.e. be included in the AwkwardArrowType.metadata), so the two use-cases strongly disagree on (3).

So how about adding both attrs and transient_attrs to ak.Array. It would be propagated to all derived arrays like behavior, merging like the behavior dict when an operation involves more than one array, but separate from the behavior machinery itself. Where they diverge is

attrs must be JSON-like and will be serialized with the array in all forms
transient_attrs can be any Python objects and will never be serialized.

dask-awkward should someday implement attrs as well, so that the JSON-like metadata can be used on Dask workers, but dask-awkward will never need to implement the transient_attrs, since you never know if a Dask object has been serialized and sent to another worker or not. (Maybe there would be value in dak.Array.transient_attrs for caching node-level information, but that can be reconsidered.) Even with that limitation, it still satisfies the Coffea use-case because Coffea only needs this at type-tracer/Dask graph-building time, and dak.Array.meta.transient_attrs would then be available.

Cc: @lgray, @nsmith-, @martindurant, @douglasdavis, @agoose77

agoose77 · 2023-06-11T14:18:34Z

I was starting to work on this, and concluded that the merging rules for attrs are at first-glance more like parameters than behavior. I've been thinking about attrs as array-associated metadata, much like parameters are. But, it seems like what Coffea wants is more like non-serialisable "context", very much like "behavior".

@lgray / @nsmith- could you possibly elucidate the properties that you'd need from a new context object?

Specifically, whether we need union or intersection properties for operations on two arrays? (behavior has union, properties have intersection). I assume you need union, because you're currently using behavior for this.

My current list is:

non serialised
union-merging rules

nsmith- · 2023-06-12T21:50:32Z

On our end we had not bothered to deal with name clashes if two NanoEvents arrays have different origins, so this didn't come up. I suppose we should figure out an appropriate hash for a NanoEventsFactory-generated array (should be straightforward) to ensure correct provenance. But in that case, I think we would actually want intersection because the mixin will not know which lineage to use for its components (unless they all get labeled somehow, which sounds like a can of worms best left unopened.)

jpivarski added the feature New feature or request label Mar 31, 2022

TomNicholas mentioned this issue Aug 10, 2022

Awkward array backend? pydata/xarray#4285

Open

jpivarski added the pr-next-release Required for the next release label Oct 31, 2022

jpivarski self-assigned this Nov 10, 2022

agoose77 removed the pr-next-release Required for the next release label Nov 15, 2022

jpivarski mentioned this issue Jan 31, 2023

Starter ideas for using xarray and Awkward Array together Cloud-Drift/ragged-array-idioms#4

Open

jpivarski mentioned this issue May 23, 2023

Support for units #2468

Open

lgray mentioned this issue Jun 2, 2023

Use awkward.Array.transient_attrs to store self-references when that functionality is available. CoffeaTeam/coffea#827

Closed

grst mentioned this issue Jul 7, 2023

Fix for awkward 2.3 scverse/anndata#1040

Merged

5 tasks

HealthyPear mentioned this issue Jul 28, 2023

Awkward table #2597

Closed

lgray mentioned this issue Aug 16, 2023

Dask-awkward arrays are not copyable when Coffea has been imported CoffeaTeam/coffea#878

Closed

agoose77 mentioned this issue Oct 13, 2023

feat: add .attrs to highlevel objects #2757

Merged

agoose77 mentioned this issue Nov 2, 2023

feat: add with_field_attr #2794

Closed

agoose77 closed this as completed in #2757 Nov 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xarray-style "attrs", global and per-record field #1391

xarray-style "attrs", global and per-record field #1391

jpivarski commented Mar 31, 2022

agoose77 commented Nov 12, 2022 •

edited

Loading

agoose77 commented Nov 15, 2022

agoose77 commented Feb 1, 2023 •

edited

Loading

jpivarski commented Jun 2, 2023

agoose77 commented Jun 11, 2023 •

edited

Loading

nsmith- commented Jun 12, 2023 •

edited

Loading

xarray-style "attrs", global and per-record field #1391

xarray-style "attrs", global and per-record field #1391

Comments

jpivarski commented Mar 31, 2022

Description of new feature

agoose77 commented Nov 12, 2022 • edited Loading

agoose77 commented Nov 15, 2022

agoose77 commented Feb 1, 2023 • edited Loading

jpivarski commented Jun 2, 2023

agoose77 commented Jun 11, 2023 • edited Loading

nsmith- commented Jun 12, 2023 • edited Loading

agoose77 commented Nov 12, 2022 •

edited

Loading

agoose77 commented Feb 1, 2023 •

edited

Loading

agoose77 commented Jun 11, 2023 •

edited

Loading

nsmith- commented Jun 12, 2023 •

edited

Loading