-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xarray-style "attrs", global and per-record field #1391
Comments
I was thinking about this, and the per-field attributes could be implemented using Saying this, in general, I prefer the idea of a dedicated per-field metadata object rather than piggybacking an existing mechanism. If we redesigned awkward from scratch, I'd be tempted to do the same for the Maybe the name of this mechanism is |
Jim and I discussed this - we'll move this to another release in favour of more pressing issues. This will some require some thought. |
We spoke about this today, and in-general,
Although we can still use I believe this would motivate replacing Finally, we likely also want a mechanism for resolving parameters within a dimension. This |
As discussed in CoffeaTeam/coffea#824, there's a use-case for wanting a local storage on
That use-case shares desirata (1) and (2) with the So how about adding both
dask-awkward should someday implement Cc: @lgray, @nsmith-, @martindurant, @douglasdavis, @agoose77 |
I was starting to work on this, and concluded that the merging rules for @lgray / @nsmith- could you possibly elucidate the properties that you'd need from a new context object? Specifically, whether we need My current list is:
|
On our end we had not bothered to deal with name clashes if two NanoEvents arrays have different origins, so this didn't come up. I suppose we should figure out an appropriate hash for a NanoEventsFactory-generated array (should be straightforward) to ensure correct provenance. But in that case, I think we would actually want intersection because the mixin will not know which lineage to use for its components (unless they all get labeled somehow, which sounds like a can of worms best left unopened.) |
Description of new feature
@philippemiron is converting data from NetCDF4 files into Awkward Arrays, and one of the features we lack is a place to put attributes. These can be descriptions, units, meanings of flags, etc., and a single field can have more than one of these (e.g. so that "units" are programmatically accessible). In general, any JSON-encodable data.
This applies at two levels: globally for a whole array, in such a way that all derived arrays pass on the attributes, and per-record field. The per-record field attributes should only be passed on if the meaning of the field is not changed, and should not be counted as part of the Content node's type. Also, these should be read and written to files as metadata wherever possible.
Why not use parameter
__doc__
?This would work as a per-record field attribute that is passed on whenever the meaning of the field is not changed, and doesn't count as part of the Content node's type. However, it has to be a string, since this is what goes into the Python
__doc__
property (and therefore IPython and Jupyter help). The attributes have to be general JSON-encodable metadata.Also, this only encodes per-record field attributes, not global attributes.
Why not use a new parameter?
It would only be accessible through idioms like
array.layout.parameters["attrs"]
. This is high-level data analyst information, and they shouldn't have to go throughlayout
, which is mid-level, for library developers. Also, such an idiom would fail if the layout ever gets buried in another Content node (i.e. the user has to be aware of layouts and how they change, which is not a high-level view!). For example, rearranging records in a RecordArray nests the RecordArray within an IndexedArray for performance reasons; so after certain kinds of slices,array.layout.parameters["attrs"]
becomesarray.layout.content.parameters["attrs"]
, probably unexpectedly.Why not use behavior?
It's an odd thing to do, but
behavior
is passed from an ak.Array to any new ak.Array derived from it. Any keys inbehavior
that aren't recognized are ignored. However, this would only encode global attributes, not per-record field attributes, andbehavior
(which typically contains function objects and class objects) is not serialized when writing files, or filled when reading files.What instead?
This is really two new features:
__doc__
, but doesn't have to be a string and is exposed at high-level in a different way (one that allows non-strings). Parameters are already serializable.behavior
, but is serializable.They should have names like
attrs
, following xarray's convention, but the per-record fieldattrs
should be distinguishable somehow from the globalattrs
.Perhaps both of these should be dicts (JSON objects) and when we're looking at one field, the two dicts are overlaid, with the per-field keys taking precedence over the global keys? That sounds natural and would minimize the use of names, but it sounds like just the sort of thing that would break something in the future.
What does xarray do?
xarray never has this issue because DataArray and Dataset are different types and can never be confused. In Awkward, an ak.Array is an ak.Array is an ak.Array, regardless of whether it's an array of lists of records, an array of just the records, or an array of one of the fields of those records. The global attrs can propagate down when extracting the array of records from the array of lists of records, but when you get to the array of one of the fields of those records, there's a conflict. They should probably be named differently, since "per-field attributes" and "global attributes" are different concepts, but which one gets the exalted name of "
attrs
"? And what would the other one be called?The text was updated successfully, but these errors were encountered: